Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-347

DFS read performance suboptimal when client co-located on nodes with data

    Details

    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected.

      After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separahitting te I/O operation for each chunk. The result is that the single request for a 64-MB block ends up the disk as over a thousand smaller requests for 64-KB each.

      Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation.

      I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

      1. a.patch
        413 kB
        Tsz Wo Nicholas Sze
      2. 2013-04-01-jenkins.patch
        424 kB
        Colin Patrick McCabe
      3. 2013.02.15.consolidated4.patch
        366 kB
        Colin Patrick McCabe
      4. 2013.01.31.consolidated2.patch
        402 kB
        Colin Patrick McCabe
      5. 2013.01.31.consolidated.patch
        402 kB
        Colin Patrick McCabe
      6. 2013.01.28.design.pdf
        72 kB
        Colin Patrick McCabe
      7. full.patch
        379 kB
        Colin Patrick McCabe
      8. hdfs-347-merge.txt
        357 kB
        Todd Lipcon
      9. hdfs-347-merge.txt
        349 kB
        Todd Lipcon
      10. hdfs-347-merge.txt
        346 kB
        Todd Lipcon
      11. HDFS-347.035.patch
        343 kB
        Colin Patrick McCabe
      12. HDFS-347.033.patch
        321 kB
        Colin Patrick McCabe
      13. HDFS-347.030.patch
        260 kB
        Colin Patrick McCabe
      14. HDFS-347.029.patch
        259 kB
        Colin Patrick McCabe
      15. HDFS-347.027.patch
        261 kB
        Colin Patrick McCabe
      16. HDFS-347.026.patch
        240 kB
        Colin Patrick McCabe
      17. HDFS-347.025.patch
        240 kB
        Colin Patrick McCabe
      18. HDFS-347.024.patch
        240 kB
        Colin Patrick McCabe
      19. HDFS-347.022.patch
        249 kB
        Colin Patrick McCabe
      20. HDFS-347.021.patch
        248 kB
        Colin Patrick McCabe
      21. HDFS-347.020.patch
        245 kB
        Colin Patrick McCabe
      22. HDFS-347.019.patch
        247 kB
        Colin Patrick McCabe
      23. HDFS-347.018.patch2
        246 kB
        Colin Patrick McCabe
      24. HDFS-347.018.clean.patch
        112 kB
        Colin Patrick McCabe
      25. HDFS-347.017.patch
        247 kB
        Colin Patrick McCabe
      26. HDFS-347.017.clean.patch
        113 kB
        Colin Patrick McCabe
      27. HDFS-347.016.patch
        250 kB
        Colin Patrick McCabe
      28. HDFS-347-016_cleaned.patch
        109 kB
        Colin Patrick McCabe
      29. HDFS-347-branch-20-append.txt
        27 kB
        ryan rawson
      30. BlockReaderLocal1.txt
        28 kB
        dhruba borthakur
      31. hdfs-347.png
        16 kB
        Todd Lipcon
      32. all.tsv
        12 kB
        Todd Lipcon
      33. hdfs-347.txt
        26 kB
        Todd Lipcon
      34. local-reads-doc
        14 kB
        Todd Lipcon
      35. HADOOP-4801.3.patch
        15 kB
        George Porter
      36. HADOOP-4801.2.patch
        15 kB
        George Porter
      37. HADOOP-4801.1.patch
        14 kB
        George Porter

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Allen Wittenauer added a comment -

          I think there are some security issues here. Let me explain:

          In order for the task to read the block directly, they'll need to have read permission at the fs level for that block. This means that the HDFS will need to write the block either with global read, the same group as all users, or a compatible group as all users. So, we now have at the Unix level drives full of blocks that are world readable.

          A task gets run on those machines. How do you prevent the task from reading from a block it shouldn't be able to read from? The 'real' HDFS permissions are at the API level. Since the API can now be bypassed...

          Show
          Allen Wittenauer added a comment - I think there are some security issues here. Let me explain: In order for the task to read the block directly, they'll need to have read permission at the fs level for that block. This means that the HDFS will need to write the block either with global read, the same group as all users, or a compatible group as all users. So, we now have at the Unix level drives full of blocks that are world readable. A task gets run on those machines. How do you prevent the task from reading from a block it shouldn't be able to read from? The 'real' HDFS permissions are at the API level. Since the API can now be bypassed...
          Hide
          Raghu Angadi added a comment -

          This is something HDFS should eventually do. I would keep the focus of this jira mainly to making DFSClient access the block data directly.

          Other things like reading larger than 64KB chunks from OS are orthogonal issues and I suspect extra user to kernel context switches would make extremely small impact on CPU (if any, at least on Linux/Intel).

          Since Hadoop security is based on mostly trusted components acting responsibly, we could make DFSClient to be polite and get permission from DataNode before accessing data directly. Such a protocol would be necessary any way since DataNode might want to know how many active readers are present for a block for non-security reasons.

          Show
          Raghu Angadi added a comment - This is something HDFS should eventually do. I would keep the focus of this jira mainly to making DFSClient access the block data directly. Other things like reading larger than 64KB chunks from OS are orthogonal issues and I suspect extra user to kernel context switches would make extremely small impact on CPU (if any, at least on Linux/Intel). Since Hadoop security is based on mostly trusted components acting responsibly, we could make DFSClient to be polite and get permission from DataNode before accessing data directly. Such a protocol would be necessary any way since DataNode might want to know how many active readers are present for a block for non-security reasons.
          Hide
          He Yongqiang added a comment - - edited

          We also see the same problem. Hope this could be fixed.
          for read, maybe we could just tell the DFSClient where the data lies and let DFSClient fetch the data itself, instead of sending the data directly to DFSClient.

          Show
          He Yongqiang added a comment - - edited We also see the same problem. Hope this could be fixed. for read, maybe we could just tell the DFSClient where the data lies and let DFSClient fetch the data itself, instead of sending the data directly to DFSClient.
          Hide
          George Porter added a comment -

          This patch contains a proof-of-concept implementation of "HDFS Direct-I/O", which is a mechanism for HDFS clients to directly access datablocks residing on the local host. It does this by bypassing the standard HDFS datablock streaming protocol to the local DataNode, and instead locates and opens the raw datablocks directly.

          The changes are mostly contained to DFSClient.java. BlockReader is now an interface, and there are two implementing classes: RemoteBlockReader and DirectBlockReader. RemoteBlockReader is the baseline, working exactly as before. DirectBlockReader instances are created in blockSeekTo(long). A check is made to see if the requested offset resides in a local block on the same machine as the DFSClient. If so, then the DirectBlockReader opens that file and passes through any I/O operations directly to the Java filesystem layer.

          I performed two main sets of performance testing: a streaming test and a random read test. Both tests were performed on a single machine running the Hadoop trunk code in the pseudodistributed mode (with 1 datanode, 1 namenode, and dfs.replication set to 1 and the default block size).

          In the streaming test, I opened a 1GB file and read it from start to finish into memory.
          Baseline: 8.730 seconds with std. dev of 0.052
          DirectIO: 5.266 seconds with std. dev of 0.116

          In the random test, I opened a 1GB or 4GB file, then performed a set of random reads, by picking a random offset in the file, seeking to that offset, and reading 1KB of data.

          For the 1 GB file
          Baseline with 1024 reads: 861 reads/second
          DirectIO with 1024 reads: 5988 reads/second

          Baseline with 4096 reads: 1065 reads/second
          DirectIO with 4096 reads: 9287 reads/second

          For the 4 GB file
          Baseline with 65,535 reads: 535 reads/second
          DirectIO with 65,535 reads: 17,852 reads/second

          It's hard to tell how much these results are affected by various disk caches, etc., and so I wanted to put this patch out there to get your experiences with it. Obviously local block read performance will only improve application performance to the extent that you read from locally resident disk blocks.

          Your feedback appreciated! Thanks.

          Show
          George Porter added a comment - This patch contains a proof-of-concept implementation of "HDFS Direct-I/O", which is a mechanism for HDFS clients to directly access datablocks residing on the local host. It does this by bypassing the standard HDFS datablock streaming protocol to the local DataNode, and instead locates and opens the raw datablocks directly. The changes are mostly contained to DFSClient.java. BlockReader is now an interface, and there are two implementing classes: RemoteBlockReader and DirectBlockReader. RemoteBlockReader is the baseline, working exactly as before. DirectBlockReader instances are created in blockSeekTo(long). A check is made to see if the requested offset resides in a local block on the same machine as the DFSClient. If so, then the DirectBlockReader opens that file and passes through any I/O operations directly to the Java filesystem layer. I performed two main sets of performance testing: a streaming test and a random read test. Both tests were performed on a single machine running the Hadoop trunk code in the pseudodistributed mode (with 1 datanode, 1 namenode, and dfs.replication set to 1 and the default block size). In the streaming test, I opened a 1GB file and read it from start to finish into memory. Baseline: 8.730 seconds with std. dev of 0.052 DirectIO: 5.266 seconds with std. dev of 0.116 In the random test, I opened a 1GB or 4GB file, then performed a set of random reads, by picking a random offset in the file, seeking to that offset, and reading 1KB of data. For the 1 GB file Baseline with 1024 reads: 861 reads/second DirectIO with 1024 reads: 5988 reads/second Baseline with 4096 reads: 1065 reads/second DirectIO with 4096 reads: 9287 reads/second For the 4 GB file Baseline with 65,535 reads: 535 reads/second DirectIO with 65,535 reads: 17,852 reads/second It's hard to tell how much these results are affected by various disk caches, etc., and so I wanted to put this patch out there to get your experiences with it. Obviously local block read performance will only improve application performance to the extent that you read from locally resident disk blocks. Your feedback appreciated! Thanks.
          Hide
          Bryan Duxbury added a comment -

          This sounds super promising. I'll try to take a look at the patch this weekend.

          As far as security concerns above, I think that the answer is that you make the blocks themselves world readable, but you make the metadata about which blocks belong to which files something that the namenode keeps behind authorization. So, true, you could scan all the blocks and read data you're not supposed to be able to, but depending on the amount of data and number of nodes in this system, this is probably both impractical and unlikely to give you very much useful unauthorized data.

          Show
          Bryan Duxbury added a comment - This sounds super promising. I'll try to take a look at the patch this weekend. As far as security concerns above, I think that the answer is that you make the blocks themselves world readable, but you make the metadata about which blocks belong to which files something that the namenode keeps behind authorization. So, true, you could scan all the blocks and read data you're not supposed to be able to, but depending on the amount of data and number of nodes in this system, this is probably both impractical and unlikely to give you very much useful unauthorized data.
          Hide
          Owen O'Malley added a comment -

          I think this patch is going the wrong way. Giving up the security and sanity of a single data path is a really high cost. I think it is much more important to get the pread performance up than add a second, security-destroying datapath.

          Show
          Owen O'Malley added a comment - I think this patch is going the wrong way. Giving up the security and sanity of a single data path is a really high cost. I think it is much more important to get the pread performance up than add a second, security-destroying datapath.
          Hide
          Doug Cutting added a comment -

          > make the blocks themselves world readable, but you make the metadata about which blocks belong to which files something that the namenode keeps behind authorization

          That's insufficient. It may work in some cases, but it cannot be considered secure.

          The micro benchmark is impressive, but it would be good to also find a macro benchmark that benefits significantly from this. For mapreduce jobs whose output size is proportional to input size, the cost of output would still dominate, no?

          Show
          Doug Cutting added a comment - > make the blocks themselves world readable, but you make the metadata about which blocks belong to which files something that the namenode keeps behind authorization That's insufficient. It may work in some cases, but it cannot be considered secure. The micro benchmark is impressive, but it would be good to also find a macro benchmark that benefits significantly from this. For mapreduce jobs whose output size is proportional to input size, the cost of output would still dominate, no?
          Hide
          Raghu Angadi added a comment -

          It is good start for a proof-of-concept. Currently it does not take care of CRCs.. I think it should to be in par with normal read. In a real implementation, Client would talk to DataNode about the block location (preferably over locahost interface).

          Regd random reads, I don't think reading directly alone explains the results. There is something odd. What can explain performance increase as the size of the file increases (if anything, it should only reduce).

          Show
          Raghu Angadi added a comment - It is good start for a proof-of-concept. Currently it does not take care of CRCs.. I think it should to be in par with normal read. In a real implementation, Client would talk to DataNode about the block location (preferably over locahost interface). Regd random reads, I don't think reading directly alone explains the results. There is something odd. What can explain performance increase as the size of the file increases (if anything, it should only reduce).
          Hide
          Raghu Angadi added a comment -

          Security is of course important and there could be some more ways other than just reading over a TCP socket (shared memory and mem mapped file, for e.g). In the worst case this could be configuration option where only less security sensitive clusters might use.

          Since Security in Hadoop is in very early stages, I don't think work in this jira should be too attached to it, yet. There should be an option to disable this feature in a cluster, of course.

          Show
          Raghu Angadi added a comment - Security is of course important and there could be some more ways other than just reading over a TCP socket (shared memory and mem mapped file, for e.g). In the worst case this could be configuration option where only less security sensitive clusters might use. Since Security in Hadoop is in very early stages, I don't think work in this jira should be too attached to it, yet. There should be an option to disable this feature in a cluster, of course.
          Hide
          George Porter added a comment -

          Thanks for the comments. I'm working on a macro benchmark with a much larger dataset now. I think that some of the performance gains as the number of reads increases is due to reads hitting an already warm cache (that would explain why more reads == higher performance, even on larger sized files). I definitely agree that adding additional datapaths is not desirable. However, if local reads dominate and are the common case, then we may want to look into optimizing that common case. Like I said, I'm not really attached to this approach (its mostly a hack) but wanted to at least get a sense of what gains might be possible as a point of reference, especially if we up the number of disks per core to the double digits (e.g., 16 disks per core).

          In terms of security, wouldn't the ideal approach be to encrypt each datablock with its own key? The namenode would keep that key as part of the HDFS metadata, and when an appropriately authorized client issues a getBlockLocations(), those keys are sent back to the client? Then the DataNodes don't have to worry about security at all (that is enforced in the namenode, a logically centralized place).

          What's nice about that too is that if the DataNodes were to be compromised, hacked, or just decommissioned, you wouldn't have to worry about leftover data floating around out there. This would especially be true in virtual datacenter or datacenter on demand environments where machines are spun up due to peak loads, then released when no longer needed.

          Show
          George Porter added a comment - Thanks for the comments. I'm working on a macro benchmark with a much larger dataset now. I think that some of the performance gains as the number of reads increases is due to reads hitting an already warm cache (that would explain why more reads == higher performance, even on larger sized files). I definitely agree that adding additional datapaths is not desirable. However, if local reads dominate and are the common case, then we may want to look into optimizing that common case. Like I said, I'm not really attached to this approach (its mostly a hack) but wanted to at least get a sense of what gains might be possible as a point of reference, especially if we up the number of disks per core to the double digits (e.g., 16 disks per core). In terms of security, wouldn't the ideal approach be to encrypt each datablock with its own key? The namenode would keep that key as part of the HDFS metadata, and when an appropriately authorized client issues a getBlockLocations(), those keys are sent back to the client? Then the DataNodes don't have to worry about security at all (that is enforced in the namenode, a logically centralized place). What's nice about that too is that if the DataNodes were to be compromised, hacked, or just decommissioned, you wouldn't have to worry about leftover data floating around out there. This would especially be true in virtual datacenter or datacenter on demand environments where machines are spun up due to peak loads, then released when no longer needed.
          Hide
          Allen Wittenauer added a comment - - edited

          Do these reads still show up in the audit logs?

          Also:

          > In terms of security, wouldn't the ideal approach be to encrypt each
          > datablock with its own key?

          That doesn't help you with file permissions.

          Show
          Allen Wittenauer added a comment - - edited Do these reads still show up in the audit logs? Also: > In terms of security, wouldn't the ideal approach be to encrypt each > datablock with its own key? That doesn't help you with file permissions.
          Hide
          Doug Cutting added a comment -

          > In terms of security, wouldn't the ideal approach be to encrypt each datablock with its own key?

          That's an interesting idea. We'd want to encrypt it in the client, checksumming the encrypted data, so that the datanode would never see the cleartext. Then only metadata communication need be encrypted by the communications layer, since the data would already be encrypted, providing a performance boost over an encrypted pipe.

          Show
          Doug Cutting added a comment - > In terms of security, wouldn't the ideal approach be to encrypt each datablock with its own key? That's an interesting idea. We'd want to encrypt it in the client, checksumming the encrypted data, so that the datanode would never see the cleartext. Then only metadata communication need be encrypted by the communications layer, since the data would already be encrypted, providing a performance boost over an encrypted pipe.
          Hide
          Doug Cutting added a comment -

          > That doesn't help you with file permissions.

          Wouldn't it if the namenode only gives the decryption key to folks with read permission? You might need to encrypt & timestamp the block's key when it's passed to the client, so that only a datanode can use it...

          Show
          Doug Cutting added a comment - > That doesn't help you with file permissions. Wouldn't it if the namenode only gives the decryption key to folks with read permission? You might need to encrypt & timestamp the block's key when it's passed to the client, so that only a datanode can use it...
          Hide
          Allen Wittenauer added a comment -

          Right. You'd basically end up building a separate key for pretty much every file, if not every block....

          Show
          Allen Wittenauer added a comment - Right. You'd basically end up building a separate key for pretty much every file, if not every block....
          Hide
          stack added a comment -

          Tried installing patch to see what for an improvement I'd see at application-level. Had to infer BlockRead interface since not in the patch but I've messed up something in that no block is local in my 3 node hdfs cluster with replication of 3 nor in a one node cluster with replication 1. Lets try and figure it George over in #hbase.

          Show
          stack added a comment - Tried installing patch to see what for an improvement I'd see at application-level. Had to infer BlockRead interface since not in the patch but I've messed up something in that no block is local in my 3 node hdfs cluster with replication of 3 nor in a one node cluster with replication 1. Lets try and figure it George over in #hbase.
          Hide
          George Porter added a comment -

          This updated patch (HADOOP-4801.2.patch) includes the BlockReader interface.

          Show
          George Porter added a comment - This updated patch ( HADOOP-4801 .2.patch) includes the BlockReader interface.
          Hide
          Jonathan Gray added a comment -

          I'm having similar issue as stack. Turns out that blocks are actually located within subdir# directories inside of 'current' directory and patch is looking just in the root 'current' directory not within subdirs.

          What are these exactly? Is there a way to turn them off so all blocks are located in root directory?

          Show
          Jonathan Gray added a comment - I'm having similar issue as stack. Turns out that blocks are actually located within subdir# directories inside of 'current' directory and patch is looking just in the root 'current' directory not within subdirs. What are these exactly? Is there a way to turn them off so all blocks are located in root directory?
          Hide
          George Porter added a comment -

          Version 3 of this patch fixes a NPE exception.

          Show
          George Porter added a comment - Version 3 of this patch fixes a NPE exception.
          Hide
          Todd Lipcon added a comment -

          Attaching v1 of a design document for this feature. This does not include a test plan - that will follow once implementation has gone a bit further. Pasting the design doc below as well:

          Design Document: Local Read Optimization

          Problem Definition

          Currently, when the DFS Client is located on the same physical node as the DataNode serving the data, it does not use this knowledge to its advantage. All blocks are read through the same protocol based on a TCP connection. Early experimentation has shown that this has a 20-30% overhead when compared with reading the block files directly off the local disk.

          This JIRA seeks to improve the performance of node-local reads by providing a fast path that is enabled in this case. This case is very common, especially in the context of MapReduce jobs where tasks are scheduled local to their data.

          Although writes are likely to see an improvement here too, this JIRA will focus only on the read path. The write path is significantly more complicated due to write pipeline recovery, append support, etc. Additionally, the write path will still have to go over TCP to the non-local replicas, so the throughput improvements will probably not be as marked.

          Use Cases

          1. As mentioned above, the majority of data read during a MapReduce job tends to be from local datanodes. This optimization should improve MapReduce performance of read-constrained jobs significantly.
          2. Random reads should see a significant performance benefit with this patch as well. Applications such as the HBase Region Server should see a very large improvement.

          Users will not have to make any specific changes to use the performance improvement - the optimization should be transparent and retain all existing semantics.

          Interaction with Current System

          This behavior needs modifications in two areas:

          DataNode

          The datanode needs to be extended to provide access to the local block storage to the reading client.

          DFSInputStream

          DFSInputStream needs to be extended in order to enable the fast read path when reading from local datanodes.

          Requirements

          Unix Domain Sockets via JNI

          In order to maintain security, we cannot simply have the reader access blocks through the local filesystem. The reader may be running as an arbitrary user ID, and we should not require world-readable permissions on the block storage.

          Unix domain sockets offer the ability to transport already-open file descriptors from one peer to another using the "ancillary data" construct and the sendmsg(2) system call. This ability is documented in unix(7) under the SCM_RIGHTS section.

          Unix domain sockets are unfortunately not available in Java. We will need to employ JNI to access the appropriate system calls.

          Modify DFSClient/DataNode interaction

          The DFS Client will need to be able to initiate the fast path read when it detects it is connecting to a local DataNode. The DataNode needs to respond to this request by providing the appropriate file descriptors or by reverting to the normal slow path if the functionality has been administratively disabled, etc.

          Design

          Unix Domain Sockets in Java

          The Android open source project currently includes support for Unix Domain Sockets in the android.net package. It also includes the native JNI code to implement these classes. Android is Apache 2.0 licensed and thus we can freely use the code in Hadoop.

          The Android project relies on a lot of custom build infrastructure and utility functions. In order to reduce our dependencies, we will copy the appropriate classes into a new org.apache.hadoop.net.unix package. We will include the appropriate JNI code in the existing libhadoop library. If HADOOP-4998 (native runtime library for Hadoop) progresses in the near term, we could include this functionality there.

          The JNI code needs small modifications to work properly in the Hadoop build system without pulling in a large number of Android dependencies.

          Fast path initiation

          When DFSInputStream is connecting to a node, it can determine whether that node is local by simply inspecting the IP address. In the event that it is a local host and the fast path has not been prohibited by the Configuration, the fast path will be initiated. The fast path is simply a different BlockReader implementation.

          Fast path interface

          BlockReader will become an interface, with the current implementation being renamed to RemoteBlockReader. The fast-path for local reads will be a LocalBlockReader, which is instantiated after it has been determined that the target datanode is local.

          Fast path mechanism

          Currently, when the DFSInputStream connects to the DataNode, it sends OP_READ_BLOCK, including the access token, block id, etc. Instead, when the fast path is desired, the client will take the following steps:

          1. Opens a unix socket listening in the in-memory socket namespace. The socket's name will be identical to the clientName already available in the input stream, plus a unique ID for this specific input stream (so that parallel local readers function without collision).
          2. Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters as OP_READ_BLOCK, but indicates to the datanode that the client is looking for a local connection.
          3. The datanode performs the same access token and block validity checks as it currently does for the OP_READ_BLOCk case. Thus the security model of the current implementation is retained.
          4. If the datanode refuses for any reason, it responds over the block transceiver protocol with the same error mechanism as the current approach. If the checks pass:
            1. DN connects to the client via the unix socket.
            2. DN opens the block data file and block metadata file
            3. DN extracts the FileDescriptor objects from these InputStreams, and sends them as ancillary data on the unix domain socket. It then closes its side of the unix domain socket.
            4. DN sends an "OK" response via the TCP socket.
            5. If any error happens during this process, it sends back an error response.
          5. On the client side, if an error response is received from the OP_CONNECT_UNIX request, the client will mark a flag indicating that it should no longer try the fast path, and then fall back to the existing BlockReader.
          6. If an OK response is received, the client constructs a LocalBlockReader (LBR).
            1. The LBR reads from the unix domain socket to receive the block data and metadata file descriptors.
            2. At this point, both the TCP socket and the unix socket can be closed; the file descriptors remain valid once they have been received despite any closed sockets.
            3. The LBR then provides the BlockReader interface by simply calling seek(), read(), etc, on an input stream constructed from these file descriptors.
            4. Some refactoring may occur here to try to share checksum verification code between the LocalBlockReader and RemoteBlockReader.

          The reason for the connect-back protocol rather than having the datanode simply listen on a unix socket is to simplify the integration path. In order to listen on a socket, the datanode would need an additional thread to spawn off transceivers. Additionally, it allows for a way to verify that the client is in fact reading from the datanode on the target host/port without relying on some conventional socket path.

          DFS Read semantics clarification

          Before embarking on the above, the DFS Read semantics should be clarified. The error handling and retry semantics in the current code are quite unclear. For example, there is significant discussion in HDFS-127 that indicates a lot of confusion about proper behavior.

          Although the work is orthogonal to this patch, it will be quite beneficial to nail down the semantics of the existing implementation before attempting to add onto it. I propose this work be done in a separate JIRA concurrently with discussion on this one, with the two pieces of work to be committed together if possible. This will keep the discussion here on-point and avoid digression into discussion of existing problems like HDFS-127.

          Failure Analysis

          As described above, if any failure or exception occurs during the establishment of the fast path, the system will simply fall back to the existing slow path.

          One issue that is currently unclear is how to handle IOExceptions on the underlying blocks when the read is being performed by the client. See Work Remaining below.

          Security Analysis

          Since the block open() call is still being performed by the datanode, there is no loss of security with this patch. AccessToken checking is performed by the datanode in the same manner as currently exists. Since the blocks can be opened read-only, the recipient of the file descriptors cannot perform unwanted modification.

          Platform Support

          Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD, and Solaris. There may be some differences in the different implementations. The first version of this JIRA should target Linux only, and automatically disable itself on platforms where it will not function correctly. Since this is an optimization and not a new feature (the slow path will continue to be supported) I believe this is OK.

          Work already completed

          Initial experimentation

          The early work in HDFS-347 indicated that the performance improvements of this patch will be substantial. The experimentation modified the BlockReader to "cheat" and simply open the stored blocks with standard file APIs, which had been chmodded world readable. This improved read of a 1GB from 8.7 seconds to 5.3 seconds, and improved random IO performance by a factor of more than 30.

          Local Sockets and JNI Library

          I have already ported the local sockets JNI code from the Android project into a local branch of the Hadoop code base, and written simple unit tests to verify its operation. The JNI code compiles as part of libhadoop, and the Java side uses the existing NativeCodeLoader class. These patches will become part of the Common project.

          DFSInputStream refactoring

          To aid in testing and understanding of the code, I have refactored DFSInputStream to be a standalone class instead of an inner class of DFSClient. Additionally, I have converted BlockReader to an interface and renamed BlockReader to RemoteBlockReader. In the process I also refactored the contents of DFSInputStream to clarify the failure and retry semantics. This work should be migrated to another JIRA as mentioned above.

          Fast path initiation and basic operation

          I have implemented the algorithm as described above and added new unit tests to verify operation. Basic unit tests are currently passing using the fast path reads.

          Work Remaining / Open Questions

          Checksums

          The current implementation of LocalBlockReader does not verify checksums. Thus, some unit tests are not passing. Some refactoring will probably need to be done to share the checksum verification code between LocalBlockReader and RemoteBlockReader.

          IOException handling

          Given that the reads are now occuring directly from the client, we should investigate whether we need to add any mechanism for the client to report errors back to the DFS. The client can still report checksum errors in the existing mechanism, but we may need to add some method by which it can report IO Errors (e.g. due to a failing volume). I do not know the current state of volume error tracking in the datanode; some guidance here would be appreciated.

          Interaction with other features (e.g. Append)

          We should investigate whether (and how) this feature will interact with other ongoing work, in particular appends. If there is any complication, it should be straightforward to simply disable the fast path for any blocks currently under construction. Given that the primary benefit for the fast path is in mapreduce jobs, and mapreduce jobs rarely run on under-construction blocks, this seems reasonable and avoids a lot of complexity.

          Timeouts

          Currently, the JNI library has some TODO markings for implementation of timeouts on various socket operations. These will need to be implemented for proper operation.

          Benchmarking

          Given that this is a performance patch, benchmarks of the final implementation should be done, covering both random and sequential IO.

          Statistics/metrics tracking

          Currently, the datanode maintains metrics about the number of bytes read and written. We no longer will have accurate information unless we make reports back from the client. Alternatively, the datanode can use the "length" parameter of OP_READ_UNIX and assume that the client will always read the entirety of data it has requested. This is not a fair assumption, but the approximation may be fine.

          Audit Logs/ClientTrace

          Once the DN has sent a file descriptor for a block to the client, it is impossible to audit the byte offsets that are read. It is possible for a client to request read access to a small byte range of a block, receive a socket, and then proceed to read the entire block. We should investigate whether there is a requirement for byte-range granularity on audit logs and come up with possible solutions (eg disabling fast path for non-whole-block reads).

          File Descriptors held by Zombie Processes

          In practice on some clusters, DFSClient processes can stick around as zombie processes. In the TCP-based DFSClient, these zombie connections are eventually timed out by the DN server. In this proposed JIRA, the file descriptors would be already transferred, and thus would be stuck open on the zombie. This will not block file deletion, but does block the reclaiming of the blocks on the underlying file system. This may cause problems on HDFS instances with a lot of block churn and a bad zombie problem. Dhruba can possibly elaborate here.

          Determining local IPs

          In order to determine when to attempt the fast path, the DFSClient needs to know when it is connecting to a local datanode. This will rarely be a loopback IP address, so we need some way of determining which IPs are actually local. This will probably necessitate an additional method or two in NetUtils in order to inspect the local interface list, with some caching behavior.

          Show
          Todd Lipcon added a comment - Attaching v1 of a design document for this feature. This does not include a test plan - that will follow once implementation has gone a bit further. Pasting the design doc below as well: – Design Document: Local Read Optimization Problem Definition Currently, when the DFS Client is located on the same physical node as the DataNode serving the data, it does not use this knowledge to its advantage. All blocks are read through the same protocol based on a TCP connection. Early experimentation has shown that this has a 20-30% overhead when compared with reading the block files directly off the local disk. This JIRA seeks to improve the performance of node-local reads by providing a fast path that is enabled in this case. This case is very common, especially in the context of MapReduce jobs where tasks are scheduled local to their data. Although writes are likely to see an improvement here too, this JIRA will focus only on the read path. The write path is significantly more complicated due to write pipeline recovery, append support, etc. Additionally, the write path will still have to go over TCP to the non-local replicas, so the throughput improvements will probably not be as marked. Use Cases As mentioned above, the majority of data read during a MapReduce job tends to be from local datanodes. This optimization should improve MapReduce performance of read-constrained jobs significantly. Random reads should see a significant performance benefit with this patch as well. Applications such as the HBase Region Server should see a very large improvement. Users will not have to make any specific changes to use the performance improvement - the optimization should be transparent and retain all existing semantics. Interaction with Current System This behavior needs modifications in two areas: DataNode The datanode needs to be extended to provide access to the local block storage to the reading client. DFSInputStream DFSInputStream needs to be extended in order to enable the fast read path when reading from local datanodes. Requirements Unix Domain Sockets via JNI In order to maintain security, we cannot simply have the reader access blocks through the local filesystem. The reader may be running as an arbitrary user ID, and we should not require world-readable permissions on the block storage. Unix domain sockets offer the ability to transport already-open file descriptors from one peer to another using the "ancillary data" construct and the sendmsg(2) system call. This ability is documented in unix(7) under the SCM_RIGHTS section. Unix domain sockets are unfortunately not available in Java. We will need to employ JNI to access the appropriate system calls. Modify DFSClient/DataNode interaction The DFS Client will need to be able to initiate the fast path read when it detects it is connecting to a local DataNode. The DataNode needs to respond to this request by providing the appropriate file descriptors or by reverting to the normal slow path if the functionality has been administratively disabled, etc. Design Unix Domain Sockets in Java The Android open source project currently includes support for Unix Domain Sockets in the android.net package. It also includes the native JNI code to implement these classes. Android is Apache 2.0 licensed and thus we can freely use the code in Hadoop. The Android project relies on a lot of custom build infrastructure and utility functions. In order to reduce our dependencies, we will copy the appropriate classes into a new org.apache.hadoop.net.unix package. We will include the appropriate JNI code in the existing libhadoop library. If HADOOP-4998 (native runtime library for Hadoop) progresses in the near term, we could include this functionality there. The JNI code needs small modifications to work properly in the Hadoop build system without pulling in a large number of Android dependencies. Fast path initiation When DFSInputStream is connecting to a node, it can determine whether that node is local by simply inspecting the IP address. In the event that it is a local host and the fast path has not been prohibited by the Configuration, the fast path will be initiated. The fast path is simply a different BlockReader implementation. Fast path interface BlockReader will become an interface, with the current implementation being renamed to RemoteBlockReader. The fast-path for local reads will be a LocalBlockReader, which is instantiated after it has been determined that the target datanode is local. Fast path mechanism Currently, when the DFSInputStream connects to the DataNode, it sends OP_READ_BLOCK, including the access token, block id, etc. Instead, when the fast path is desired, the client will take the following steps: Opens a unix socket listening in the in-memory socket namespace. The socket's name will be identical to the clientName already available in the input stream, plus a unique ID for this specific input stream (so that parallel local readers function without collision). Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters as OP_READ_BLOCK, but indicates to the datanode that the client is looking for a local connection. The datanode performs the same access token and block validity checks as it currently does for the OP_READ_BLOCk case. Thus the security model of the current implementation is retained. If the datanode refuses for any reason, it responds over the block transceiver protocol with the same error mechanism as the current approach. If the checks pass: DN connects to the client via the unix socket. DN opens the block data file and block metadata file DN extracts the FileDescriptor objects from these InputStreams, and sends them as ancillary data on the unix domain socket. It then closes its side of the unix domain socket. DN sends an "OK" response via the TCP socket. If any error happens during this process, it sends back an error response. On the client side, if an error response is received from the OP_CONNECT_UNIX request, the client will mark a flag indicating that it should no longer try the fast path, and then fall back to the existing BlockReader. If an OK response is received, the client constructs a LocalBlockReader (LBR). The LBR reads from the unix domain socket to receive the block data and metadata file descriptors. At this point, both the TCP socket and the unix socket can be closed; the file descriptors remain valid once they have been received despite any closed sockets. The LBR then provides the BlockReader interface by simply calling seek(), read(), etc, on an input stream constructed from these file descriptors. Some refactoring may occur here to try to share checksum verification code between the LocalBlockReader and RemoteBlockReader. The reason for the connect-back protocol rather than having the datanode simply listen on a unix socket is to simplify the integration path. In order to listen on a socket, the datanode would need an additional thread to spawn off transceivers. Additionally, it allows for a way to verify that the client is in fact reading from the datanode on the target host/port without relying on some conventional socket path. DFS Read semantics clarification Before embarking on the above, the DFS Read semantics should be clarified. The error handling and retry semantics in the current code are quite unclear. For example, there is significant discussion in HDFS-127 that indicates a lot of confusion about proper behavior. Although the work is orthogonal to this patch, it will be quite beneficial to nail down the semantics of the existing implementation before attempting to add onto it. I propose this work be done in a separate JIRA concurrently with discussion on this one, with the two pieces of work to be committed together if possible. This will keep the discussion here on-point and avoid digression into discussion of existing problems like HDFS-127 . Failure Analysis As described above, if any failure or exception occurs during the establishment of the fast path, the system will simply fall back to the existing slow path. One issue that is currently unclear is how to handle IOExceptions on the underlying blocks when the read is being performed by the client. See Work Remaining below. Security Analysis Since the block open() call is still being performed by the datanode, there is no loss of security with this patch. AccessToken checking is performed by the datanode in the same manner as currently exists. Since the blocks can be opened read-only, the recipient of the file descriptors cannot perform unwanted modification. Platform Support Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD, and Solaris. There may be some differences in the different implementations. The first version of this JIRA should target Linux only, and automatically disable itself on platforms where it will not function correctly. Since this is an optimization and not a new feature (the slow path will continue to be supported) I believe this is OK. Work already completed Initial experimentation The early work in HDFS-347 indicated that the performance improvements of this patch will be substantial. The experimentation modified the BlockReader to "cheat" and simply open the stored blocks with standard file APIs, which had been chmodded world readable. This improved read of a 1GB from 8.7 seconds to 5.3 seconds, and improved random IO performance by a factor of more than 30. Local Sockets and JNI Library I have already ported the local sockets JNI code from the Android project into a local branch of the Hadoop code base, and written simple unit tests to verify its operation. The JNI code compiles as part of libhadoop, and the Java side uses the existing NativeCodeLoader class. These patches will become part of the Common project. DFSInputStream refactoring To aid in testing and understanding of the code, I have refactored DFSInputStream to be a standalone class instead of an inner class of DFSClient. Additionally, I have converted BlockReader to an interface and renamed BlockReader to RemoteBlockReader. In the process I also refactored the contents of DFSInputStream to clarify the failure and retry semantics. This work should be migrated to another JIRA as mentioned above. Fast path initiation and basic operation I have implemented the algorithm as described above and added new unit tests to verify operation. Basic unit tests are currently passing using the fast path reads. Work Remaining / Open Questions Checksums The current implementation of LocalBlockReader does not verify checksums. Thus, some unit tests are not passing. Some refactoring will probably need to be done to share the checksum verification code between LocalBlockReader and RemoteBlockReader. IOException handling Given that the reads are now occuring directly from the client, we should investigate whether we need to add any mechanism for the client to report errors back to the DFS. The client can still report checksum errors in the existing mechanism, but we may need to add some method by which it can report IO Errors (e.g. due to a failing volume). I do not know the current state of volume error tracking in the datanode; some guidance here would be appreciated. Interaction with other features (e.g. Append) We should investigate whether (and how) this feature will interact with other ongoing work, in particular appends. If there is any complication, it should be straightforward to simply disable the fast path for any blocks currently under construction. Given that the primary benefit for the fast path is in mapreduce jobs, and mapreduce jobs rarely run on under-construction blocks, this seems reasonable and avoids a lot of complexity. Timeouts Currently, the JNI library has some TODO markings for implementation of timeouts on various socket operations. These will need to be implemented for proper operation. Benchmarking Given that this is a performance patch, benchmarks of the final implementation should be done, covering both random and sequential IO. Statistics/metrics tracking Currently, the datanode maintains metrics about the number of bytes read and written. We no longer will have accurate information unless we make reports back from the client. Alternatively, the datanode can use the "length" parameter of OP_READ_UNIX and assume that the client will always read the entirety of data it has requested. This is not a fair assumption, but the approximation may be fine. Audit Logs/ClientTrace Once the DN has sent a file descriptor for a block to the client, it is impossible to audit the byte offsets that are read. It is possible for a client to request read access to a small byte range of a block, receive a socket, and then proceed to read the entire block. We should investigate whether there is a requirement for byte-range granularity on audit logs and come up with possible solutions (eg disabling fast path for non-whole-block reads). File Descriptors held by Zombie Processes In practice on some clusters, DFSClient processes can stick around as zombie processes. In the TCP-based DFSClient, these zombie connections are eventually timed out by the DN server. In this proposed JIRA, the file descriptors would be already transferred, and thus would be stuck open on the zombie. This will not block file deletion, but does block the reclaiming of the blocks on the underlying file system. This may cause problems on HDFS instances with a lot of block churn and a bad zombie problem. Dhruba can possibly elaborate here. Determining local IPs In order to determine when to attempt the fast path, the DFSClient needs to know when it is connecting to a local datanode. This will rarely be a loopback IP address, so we need some way of determining which IPs are actually local. This will probably necessitate an additional method or two in NetUtils in order to inspect the local interface list, with some caching behavior.
          Hide
          dhruba borthakur added a comment -

          I like this approach, but there is one thign that is not very clear in my mind... where is the real bottleneck that we are trying to avoid via this proposed mechanism? It appears that we are trying to avoid copying lots of data via the network interface. Is there any alternate way to reduce this cost, maybe use UDP? change the ethernet MTU size?

          Show
          dhruba borthakur added a comment - I like this approach, but there is one thign that is not very clear in my mind... where is the real bottleneck that we are trying to avoid via this proposed mechanism? It appears that we are trying to avoid copying lots of data via the network interface. Is there any alternate way to reduce this cost, maybe use UDP? change the ethernet MTU size?
          Hide
          Todd Lipcon added a comment -

          I ran some tests on my laptop in August where I set the mtu of my loopback interface super-high and it didn't really change DFSIO benchmarks. This was just on my laptop, though, so if someone wanted to reproduce on a real machine that would be helpful.

          Regarding the bottlenecks, here's one microbenchmark that shows that there is some significant overhead in local network connections:

          Both windows on the same machine:

          Window A:
          $ nc -l 1234 > /dev/null
          
          Window B:
          $ time dd if=/dev/zero of=/dev/fd/1 bs=1M count=4000 | nc localhost 1234
          4194304000 bytes (4.2 GB) copied, 23.8092 s, 176 MB/s
          
          real    0m23.818s
          user    0m0.948s
          sys     0m12.841s
          

          versus:

          $ time dd if=/dev/zero of=/dev/fd/1 bs=1M count=4000 | cat > /dev/null
          4000+0 records in
          4000+0 records out
          4194304000 bytes (4.2 GB) copied, 4.69959 s, 892 MB/s
          
          real    0m4.708s
          user    0m0.268s
          sys     0m4.096s
          

          The above is with a jacked-up MTU. With standard MTU, the netcat goes 136MB/sec instead of 176MB/sec.

          Granted, this is a microbenchmark, and a bit unfair since the DN uses sendfile and I'm not, here, but it does show there's significant overhead for localhost network connections.

          Show
          Todd Lipcon added a comment - I ran some tests on my laptop in August where I set the mtu of my loopback interface super-high and it didn't really change DFSIO benchmarks. This was just on my laptop, though, so if someone wanted to reproduce on a real machine that would be helpful. Regarding the bottlenecks, here's one microbenchmark that shows that there is some significant overhead in local network connections: Both windows on the same machine: Window A: $ nc -l 1234 > /dev/null Window B: $ time dd if=/dev/zero of=/dev/fd/1 bs=1M count=4000 | nc localhost 1234 4194304000 bytes (4.2 GB) copied, 23.8092 s, 176 MB/s real 0m23.818s user 0m0.948s sys 0m12.841s versus: $ time dd if=/dev/zero of=/dev/fd/1 bs=1M count=4000 | cat > /dev/null 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 4.69959 s, 892 MB/s real 0m4.708s user 0m0.268s sys 0m4.096s The above is with a jacked-up MTU. With standard MTU, the netcat goes 136MB/sec instead of 176MB/sec. Granted, this is a microbenchmark, and a bit unfair since the DN uses sendfile and I'm not, here, but it does show there's significant overhead for localhost network connections.
          Hide
          Todd Lipcon added a comment -

          Spent some more time on the implementation this weekend. Here's a benchmark including checksums:

          todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -Ddfs.client.use.unix.sockets=false -cat bigfile bigfile bigfile > /dev/null

          real 0m13.502s
          user 0m9.561s
          sys 0m2.904s

          todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -Ddfs.client.use.unix.sockets=true -cat bigfile bigfile bigfile > /dev/null
          real 0m9.644s
          user 0m8.321s
          sys 0m1.012s

          bigfile is a 700MB file that I put on my local pseudo-distributed cluster.

          For comparison, here's just catting the same file:
          todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -cat file:///var/www/ubuntu-8.10-desktop-amd64.iso file:///var/www/ubuntu-8.10-desktop-amd64.iso file:///var/www/ubuntu-8.10-desktop-amd64.iso > /dev/null
          real 0m2.914s
          user 0m1.760s
          sys 0m1.068s

          So, the result is about a 30% speedup over the current implementation, but still 3x overhead compared to local filesystem. Profiling shows that most of this is in the copying out of the direct FileChannels - I think a little bit of smart buffering there to create larger reads over the native-code boundary will get us closer to 2x overhead.

          Will clean up the patch tomorrow and upload (though it still needs plenty of work to be commitable)

          Show
          Todd Lipcon added a comment - Spent some more time on the implementation this weekend. Here's a benchmark including checksums: todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -Ddfs.client.use.unix.sockets=false -cat bigfile bigfile bigfile > /dev/null real 0m13.502s user 0m9.561s sys 0m2.904s todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -Ddfs.client.use.unix.sockets=true -cat bigfile bigfile bigfile > /dev/null real 0m9.644s user 0m8.321s sys 0m1.012s bigfile is a 700MB file that I put on my local pseudo-distributed cluster. For comparison, here's just catting the same file: todd@todd-laptop:~/git/hadoop-common/build/hadoop-core-0.21.0-dev$ time ./bin/hadoop fs -cat file:///var/www/ubuntu-8.10-desktop-amd64.iso file:///var/www/ubuntu-8.10-desktop-amd64.iso file:///var/www/ubuntu-8.10-desktop-amd64.iso > /dev/null real 0m2.914s user 0m1.760s sys 0m1.068s So, the result is about a 30% speedup over the current implementation, but still 3x overhead compared to local filesystem. Profiling shows that most of this is in the copying out of the direct FileChannels - I think a little bit of smart buffering there to create larger reads over the native-code boundary will get us closer to 2x overhead. Will clean up the patch tomorrow and upload (though it still needs plenty of work to be commitable)
          Hide
          dhruba borthakur added a comment -

          Hi Todd, This is nice work. The design is somewhat complex, especially because you make the DN connect back to the client. I understand your motivation for doing this, but if this works out well, won't the design be better if we make the DN listen in on a UNIX-domain socket as well?

          You have mentioned that that the patch improves performance by 30% but it is still 300% slower than native Java File read-path. Do you have any insights into why this 300% slowdown is occuring? Is it because of some buffer manipulations in the DFS client read-path?

          Show
          dhruba borthakur added a comment - Hi Todd, This is nice work. The design is somewhat complex, especially because you make the DN connect back to the client. I understand your motivation for doing this, but if this works out well, won't the design be better if we make the DN listen in on a UNIX-domain socket as well? You have mentioned that that the patch improves performance by 30% but it is still 300% slower than native Java File read-path. Do you have any insights into why this 300% slowdown is occuring? Is it because of some buffer manipulations in the DFS client read-path?
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba,

          The connect-back is definitely still up for discussion. I think it's good from a security standpoint to verify that the client is speaking to the datanode and not an imposter. This is definitely the simplest part of the code, though, so we can easily change it if people disagree with me.

          I'm still trying to figure out the reason for the overhead. So far, my thoughts are:

          1. Checksumming (I was comparing to RawLocalFileSystem, not ChecksumFileSystem). This is better in 0.21 with the new PureJavaCrc32, but still accounts for some overhead
          2. In the above measurements I'm using FileChannel.map to get MappedByteBuffers for the block and metadata files, then using .get() to do copies into the provided arrays. Profiling shows most of the time in java.nio.Bits.copyToByteArray. Right now all transfers from these mapped buffers are checksum-sized (512 bytes by default) and there appears to be a lot of overhead there. Next order of business, performance wise, is to see if introducing a 64KB byte[] buffer will improve things somewhat. This does not apply to BlockSender, though, since that already forms packets of (I think) 10 checksum chunks at a time.

          More theories of course are welcome http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly is an interesting resource on this topic as well.

          Show
          Todd Lipcon added a comment - Hey Dhruba, The connect-back is definitely still up for discussion. I think it's good from a security standpoint to verify that the client is speaking to the datanode and not an imposter. This is definitely the simplest part of the code, though, so we can easily change it if people disagree with me. I'm still trying to figure out the reason for the overhead. So far, my thoughts are: Checksumming (I was comparing to RawLocalFileSystem, not ChecksumFileSystem). This is better in 0.21 with the new PureJavaCrc32, but still accounts for some overhead In the above measurements I'm using FileChannel.map to get MappedByteBuffers for the block and metadata files, then using .get() to do copies into the provided arrays. Profiling shows most of the time in java.nio.Bits.copyToByteArray. Right now all transfers from these mapped buffers are checksum-sized (512 bytes by default) and there appears to be a lot of overhead there. Next order of business, performance wise, is to see if introducing a 64KB byte[] buffer will improve things somewhat. This does not apply to BlockSender, though, since that already forms packets of (I think) 10 checksum chunks at a time. More theories of course are welcome http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly is an interesting resource on this topic as well.
          Hide
          Todd Lipcon added a comment -

          Was planning on uploading a patch today, but it's fallen out of date due to changes in DFSClient related to append, etc. Should get to uploading a rebased patch later this week.

          Show
          Todd Lipcon added a comment - Was planning on uploading a patch today, but it's fallen out of date due to changes in DFSClient related to append, etc. Should get to uploading a rebased patch later this week.
          Hide
          Todd Lipcon added a comment -

          Here's a patch that implements the design as detailed above. More benchmarks and discussion of remaining work to come.

          This depends on the core jars built from HADOOP-6311, plus running ant like:

          ant -Djava.library.path=/home/todd/git/hadoop-common/build/native/Linux-amd64-64/lib/ run-dfsclient-test

          Show
          Todd Lipcon added a comment - Here's a patch that implements the design as detailed above. More benchmarks and discussion of remaining work to come. This depends on the core jars built from HADOOP-6311 , plus running ant like: ant -Djava.library.path=/home/todd/git/hadoop-common/build/native/Linux-amd64-64/lib/ run-dfsclient-test
          Hide
          Raghu Angadi added a comment -

          great to see progress on this.

          • This is does not apply to BlockSender, though, since that already forms packets of (I think) 10 checksum chunks at a time.

            • It is an issue for BlockSender as well since the limitation is imposed by FSInputChecker. readChunk() only gives access to 512 bytes of user buffer to implementation. Filed HADOOP-3205 quite sometime back. It would let BlockReader avoid a copy as well.
          • isLoopbackAddress()?
            • Is this a temporary hack? Client sees non-loopback address even if the datanode is local.
            • In the current implementation you could first connect then check (socket.localAddr == socket.remoteAddr) to decide go with local read.
          • listening on "dfsclinet_clientname"
            • What happens with multiple client reads?
            • we could use "dfsclinet_some_rand_str_with_blockid".
            • Instead, making datanode listen might have more advantages (like random read latency mentioned below)
          • random reads
            • since random read bottleneck is connection latency and disk seeks, why do you think this improves random read performance? This implementation has all the latency overhead as before (Ignoring latency of connecting to local unix socket, which might be negligible compared to a tcp connection).
            • If we make datanode listen on unix domain, we could have real latency improvement.
            • extra thread is more than offset by threads avoided to read local data.
            • does a typical HBase installation do predominantly do local reads?
          • I suspect reporting bytes read by clients might be important issue to fix properly.
            • it could be an RPC to dn at the end of a proper termination.
            • or if datanode is listening on unix domain, it could be sent over a unix socket.
          Show
          Raghu Angadi added a comment - great to see progress on this. This is does not apply to BlockSender, though, since that already forms packets of (I think) 10 checksum chunks at a time. It is an issue for BlockSender as well since the limitation is imposed by FSInputChecker. readChunk() only gives access to 512 bytes of user buffer to implementation. Filed HADOOP-3205 quite sometime back. It would let BlockReader avoid a copy as well. isLoopbackAddress()? Is this a temporary hack? Client sees non-loopback address even if the datanode is local. In the current implementation you could first connect then check (socket.localAddr == socket.remoteAddr) to decide go with local read. listening on "dfsclinet_clientname" What happens with multiple client reads? we could use "dfsclinet_some_rand_str_with_blockid". Instead, making datanode listen might have more advantages (like random read latency mentioned below) random reads since random read bottleneck is connection latency and disk seeks, why do you think this improves random read performance? This implementation has all the latency overhead as before (Ignoring latency of connecting to local unix socket, which might be negligible compared to a tcp connection). If we make datanode listen on unix domain, we could have real latency improvement. extra thread is more than offset by threads avoided to read local data. does a typical HBase installation do predominantly do local reads? I suspect reporting bytes read by clients might be important issue to fix properly. it could be an RPC to dn at the end of a proper termination. or if datanode is listening on unix domain, it could be sent over a unix socket.
          Hide
          Todd Lipcon added a comment -

          Filed HADOOP-3205 quite sometime back. It would let BlockReader avoid a copy as well.

          Cool, I'll take a look at that. I agree that it should help performance.

          Is this a temporary hack? Client sees non-loopback address even if the datanode is local.

          Yep - see the "work remaining" above. This was good enough for testing on my pseudo-distributed cluster, but we'll need something fancier for the real deal. I think your trick about connecting and then looking at the addresses may be just right - very clever!

          we could use "dfsclinet_some_rand_str_with_blockid".

          Yea, I thought I had a TODO in there. If not, I mean to We should have some kind of random string (or autoincrement static field in DFSClient). Making datanode listen still seems like it's susceptible to some kind of impostor attack.

          since random read bottleneck is connection latency and disk seeks, why do you think this improves random read performance?

          It currently doesn't. I anticipate adding some kind of call like "boolean canSeekToBlockOffset(long pos)" to the BlockReader interface. In the case of LocalBlockReader, it can seek "for free" within the already-open file descriptor with zero latency beyond what the IO itself might cost. I was planning on adding that in another JIRA, but could certainly add it here too. For the existing BlockReader, we can return true for the case when the target position is within the TCP window – this is an optimization currently in DFSInputStream that should move into BlockReader.

          does a typical HBase installation do predominantly do local reads?

          I'm not sure of this - I think some of the HBase guys are watching this ticket. It seems to me, though, that it shouldn't be hard to convince HBase region servers to match up with local blocks.

          I suspect reporting bytes read by clients might be important issue to fix properly.

          +1. The improper termination is probably impossible to deal with, but an honesty-based policy would work, as long as we figure that clients have nothing to gain by lying.

          Show
          Todd Lipcon added a comment - Filed HADOOP-3205 quite sometime back. It would let BlockReader avoid a copy as well. Cool, I'll take a look at that. I agree that it should help performance. Is this a temporary hack? Client sees non-loopback address even if the datanode is local. Yep - see the "work remaining" above. This was good enough for testing on my pseudo-distributed cluster, but we'll need something fancier for the real deal. I think your trick about connecting and then looking at the addresses may be just right - very clever! we could use "dfsclinet_some_rand_str_with_blockid". Yea, I thought I had a TODO in there. If not, I mean to We should have some kind of random string (or autoincrement static field in DFSClient). Making datanode listen still seems like it's susceptible to some kind of impostor attack. since random read bottleneck is connection latency and disk seeks, why do you think this improves random read performance? It currently doesn't. I anticipate adding some kind of call like "boolean canSeekToBlockOffset(long pos)" to the BlockReader interface. In the case of LocalBlockReader, it can seek "for free" within the already-open file descriptor with zero latency beyond what the IO itself might cost. I was planning on adding that in another JIRA, but could certainly add it here too. For the existing BlockReader, we can return true for the case when the target position is within the TCP window – this is an optimization currently in DFSInputStream that should move into BlockReader. does a typical HBase installation do predominantly do local reads? I'm not sure of this - I think some of the HBase guys are watching this ticket. It seems to me, though, that it shouldn't be hard to convince HBase region servers to match up with local blocks. I suspect reporting bytes read by clients might be important issue to fix properly. +1. The improper termination is probably impossible to deal with, but an honesty-based policy would work, as long as we figure that clients have nothing to gain by lying.
          Hide
          Andrew Purtell added a comment -

          does a typical HBase installation do predominantly do local reads?

          I'm not sure of this - I think some of the HBase guys are watching this ticket. It seems to me, though, that it shouldn't be hard to convince HBase region servers to match up with local blocks.

          To match up with local blocks, at a minimum our Master would need to learn the block locations of all store files for a region through some API and then weight the deployment choices for regions according to the target region server being considered. But this would only go so far as it is also important for region load to be level. In my opinion it's probably not worth the effort to worry about block placement during the initial region deployment.

          After a compaction there would be good block locality for subsequent reads because the reader would be getting back blocks it had written. HBase does both minor and major compaction. A minor compaction consolidates some flush files. Major compaction rewrites an entire region into a single store file. For a mixed read/write installation this means that gradually blocks are brought local to the region servers as this rewriting happens. One could make major compaction run more frequently (8 hours, etc.) to guarantee that data blocks are brought local to their respective region servers after this period.

          Show
          Andrew Purtell added a comment - does a typical HBase installation do predominantly do local reads? I'm not sure of this - I think some of the HBase guys are watching this ticket. It seems to me, though, that it shouldn't be hard to convince HBase region servers to match up with local blocks. To match up with local blocks, at a minimum our Master would need to learn the block locations of all store files for a region through some API and then weight the deployment choices for regions according to the target region server being considered. But this would only go so far as it is also important for region load to be level. In my opinion it's probably not worth the effort to worry about block placement during the initial region deployment. After a compaction there would be good block locality for subsequent reads because the reader would be getting back blocks it had written. HBase does both minor and major compaction. A minor compaction consolidates some flush files. Major compaction rewrites an entire region into a single store file. For a mixed read/write installation this means that gradually blocks are brought local to the region servers as this rewriting happens. One could make major compaction run more frequently (8 hours, etc.) to guarantee that data blocks are brought local to their respective region servers after this period.
          Hide
          Owen O'Malley added a comment -

          A requirement on this work it must be possible to disable this feature, if the cluster admin doesn't want it.

          Of course, it must also be disabled if the native library isn't available in either the client or datanode.

          How portable is passing fds through unix domain sockets? I assume it does not work on cygwin. What about Solaris, Mac OS, and BSD?

          Show
          Owen O'Malley added a comment - A requirement on this work it must be possible to disable this feature, if the cluster admin doesn't want it. Of course, it must also be disabled if the native library isn't available in either the client or datanode. How portable is passing fds through unix domain sockets? I assume it does not work on cygwin. What about Solaris, Mac OS, and BSD?
          Hide
          Owen O'Malley added a comment -

          I'd also request that the benchmark include gridmix rather than just a straight throughput test.

          Show
          Owen O'Malley added a comment - I'd also request that the benchmark include gridmix rather than just a straight throughput test.
          Hide
          Todd Lipcon added a comment -

          A requirement on this work it must be possible to disable this feature, if the cluster admin doesn't want it.

          In this patch there's a configuration dfs.client.use.unix.sockets which disables the feature. If the native code doesn't load, it also disables it. On a per-DFSClient basis, it also keeps a boolean which gets turned to false in the case that the fast path throws any exception.

          How portable is passing fds through unix domain sockets? I assume it does not work on cygwin. What about Solaris, Mac OS, and BSD?

          I know it's supported on Solaris, OSX, and BSD, though we may need to change the native code slightly to make sure it's portable (I think the different platforms may use slightly different names for the same thing). This will be part of the test plan. As for Windows, I know that such capability exists with some Windows APIs, but I doubt it works in Cygwin. Since this is an optional fast path, and I don't know of anyone deploying real clusters on Cygwin, I don't think it's a problem to be unsupported.

          Show
          Todd Lipcon added a comment - A requirement on this work it must be possible to disable this feature, if the cluster admin doesn't want it. In this patch there's a configuration dfs.client.use.unix.sockets which disables the feature. If the native code doesn't load, it also disables it. On a per-DFSClient basis, it also keeps a boolean which gets turned to false in the case that the fast path throws any exception. How portable is passing fds through unix domain sockets? I assume it does not work on cygwin. What about Solaris, Mac OS, and BSD? I know it's supported on Solaris, OSX, and BSD, though we may need to change the native code slightly to make sure it's portable (I think the different platforms may use slightly different names for the same thing). This will be part of the test plan. As for Windows, I know that such capability exists with some Windows APIs, but I doubt it works in Cygwin. Since this is an optional fast path, and I don't know of anyone deploying real clusters on Cygwin, I don't think it's a problem to be unsupported.
          Hide
          Todd Lipcon added a comment -

          I'd also request that the benchmark include gridmix rather than just a straight throughput test.

          +1. I'll be working on a thorough test plan this week and make sure it includes full-stack benchmarks as well as microbenchmark throughput tests. Thanks for the feedback.

          Show
          Todd Lipcon added a comment - I'd also request that the benchmark include gridmix rather than just a straight throughput test. +1. I'll be working on a thorough test plan this week and make sure it includes full-stack benchmarks as well as microbenchmark throughput tests. Thanks for the feedback.
          Hide
          Raghu Angadi added a comment -

          Making datanode listen still seems like it's susceptible to some kind of impostor attack.

          Hey Todd,

          could you expand on why the threat increases?

          my guess is that we might be worried about case where an imposter is listening on unix name "DNx_predictable_name" though DNx is a remote machine. This might cause client to incorrectly contact local imposter. If this is the case, it could be avoided the same way client handles tcp connect : it is part DN info that NN sends to client.

          Show
          Raghu Angadi added a comment - Making datanode listen still seems like it's susceptible to some kind of impostor attack. Hey Todd, could you expand on why the threat increases? my guess is that we might be worried about case where an imposter is listening on unix name "DNx_predictable_name" though DNx is a remote machine. This might cause client to incorrectly contact local imposter. If this is the case, it could be avoided the same way client handles tcp connect : it is part DN info that NN sends to client.
          Hide
          dhruba borthakur added a comment -

          > So, the result is about a 30% speedup over the current implementation, but still 3x overhead compared to local filesystem.

          I would very much like to know where the 3x performance degradation really is, and if it is easier to solve that problem. If we can find where the 3x degradation is orginating from, then a corresponding patch has the potential to make the read path 300% percent faster (rather than the 30% that this current patch offers). I agree that in the short term 30% is better than nothing, but the architecture of this patch is somewhat complex. It required the DN to connect back to the client, it requires JNI and probably does not work on all platforms, it requires a patched-external package "android", etc.etc.

          Todd, is there a way for you to measure why the 3x degradation occurs?

          Show
          dhruba borthakur added a comment - > So, the result is about a 30% speedup over the current implementation, but still 3x overhead compared to local filesystem. I would very much like to know where the 3x performance degradation really is, and if it is easier to solve that problem. If we can find where the 3x degradation is orginating from, then a corresponding patch has the potential to make the read path 300% percent faster (rather than the 30% that this current patch offers). I agree that in the short term 30% is better than nothing, but the architecture of this patch is somewhat complex. It required the DN to connect back to the client, it requires JNI and probably does not work on all platforms, it requires a patched-external package "android", etc.etc. Todd, is there a way for you to measure why the 3x degradation occurs?
          Hide
          Todd Lipcon added a comment -

          Took some time to rebase this work against trunk (with HADOOP-5205 and HDFS-755 patched in as well). Here's a graph (and the data that made it) comparing the following:

          • checksumfs.tsv - reading a file:/// URL with an associated checksum file on my local disk
          • raw.tsv - reading the same file, but with no checksum file
          • without.tsv - pseudo-distributed HDFS with dfs.client.use.unix.sockets=false
          • with.tsv - same HDFS, but with dfs.client.use.unix.sockets=true

          For all of these tests, I used a 691MB file, and double checked md5sum output to make sure they were all reading it correctly. Each box plot shows the distribution of 50 trials of fs -cat /path/to/file. io.file.buffer.size was set to 64K for all trials.

          The big surprise here is that somehow HDFS with this patch came out faster than ChecksumFileSystem. The sys time for the same doesn't show any difference, but HDFS is using less CPU time. Since this doesn't make much sense, I reran both the HDFS and ChecksumFs benchmarks a second time and the results were the same. If anyone cares to wager a guess about how this could be possible, I'd appreciate it Otherwise, I will try to dig into this.

          The inclusion of raw shows the same 200-300% difference referenced in earlier comments in this jira. There's no optimization we can make here aside from speeding up checksumming. The HADOOP-5205/HDFS-755 patches improved this a bit, but it's still the major difference. As noted above, this patch makes reading from the local DN perform at least as well as reading from a local checksummed system (if not inexplicably better).

          Show
          Todd Lipcon added a comment - Took some time to rebase this work against trunk (with HADOOP-5205 and HDFS-755 patched in as well). Here's a graph (and the data that made it) comparing the following: checksumfs.tsv - reading a file:/// URL with an associated checksum file on my local disk raw.tsv - reading the same file, but with no checksum file without.tsv - pseudo-distributed HDFS with dfs.client.use.unix.sockets=false with.tsv - same HDFS, but with dfs.client.use.unix.sockets=true For all of these tests, I used a 691MB file, and double checked md5sum output to make sure they were all reading it correctly. Each box plot shows the distribution of 50 trials of fs -cat /path/to/file. io.file.buffer.size was set to 64K for all trials. The big surprise here is that somehow HDFS with this patch came out faster than ChecksumFileSystem. The sys time for the same doesn't show any difference, but HDFS is using less CPU time. Since this doesn't make much sense, I reran both the HDFS and ChecksumFs benchmarks a second time and the results were the same. If anyone cares to wager a guess about how this could be possible, I'd appreciate it Otherwise, I will try to dig into this. The inclusion of raw shows the same 200-300% difference referenced in earlier comments in this jira. There's no optimization we can make here aside from speeding up checksumming. The HADOOP-5205 / HDFS-755 patches improved this a bit, but it's still the major difference. As noted above, this patch makes reading from the local DN perform at least as well as reading from a local checksummed system (if not inexplicably better).
          Hide
          Todd Lipcon added a comment -

          Oops, I typoed above - not HADOOP-5205, but HADOOP-3205.

          Show
          Todd Lipcon added a comment - Oops, I typoed above - not HADOOP-5205 , but HADOOP-3205 .
          Hide
          Todd Lipcon added a comment -

          Mystery solved! ChecksumFileSystem is resulting in extra fstat and lseek syscalls that LocalBlockReader isn't.

          strace on the HDFS benchmark shows read/write interleaved for the duration of the block:

          2842  read(51, "0[\346\256\222\331\311\177]\2455x\16\23\33\312\211\4\rw]x\334YPs#\314\242R7\5"..., 65536) = 65536
          2842  write(1, "0[\346\256\222\331\311\177]\2455x\16\23\33\312\211\4\rw]x\334YPs#\314\242R7\5"..., 65536) = 65536
          

          strace on ChecksumFileSystem shows the same, but after every 512KB it does:

          2909  fstat(49, {st_mode=S_IFREG|0644, st_size=5659016, ...}) = 0
          2909  lseek(49, 0, SEEK_CUR)            = 5636096
          2909  lseek(49, 0, SEEK_END)            = 5659016
          2909  lseek(49, 5636096, SEEK_SET)      = 5636096
          

          Will see if I can figure out how to fix this in another JIRA.

          Show
          Todd Lipcon added a comment - Mystery solved! ChecksumFileSystem is resulting in extra fstat and lseek syscalls that LocalBlockReader isn't. strace on the HDFS benchmark shows read/write interleaved for the duration of the block: 2842 read(51, "0[\346\256\222\331\311\177]\2455x\16\23\33\312\211\4\rw]x\334YPs#\314\242R7\5"..., 65536) = 65536 2842 write(1, "0[\346\256\222\331\311\177]\2455x\16\23\33\312\211\4\rw]x\334YPs#\314\242R7\5"..., 65536) = 65536 strace on ChecksumFileSystem shows the same, but after every 512KB it does: 2909 fstat(49, {st_mode=S_IFREG|0644, st_size=5659016, ...}) = 0 2909 lseek(49, 0, SEEK_CUR) = 5636096 2909 lseek(49, 0, SEEK_END) = 5659016 2909 lseek(49, 5636096, SEEK_SET) = 5636096 Will see if I can figure out how to fix this in another JIRA.
          Hide
          dhruba borthakur added a comment -

          This is turning out to be interesting one!

          Show
          dhruba borthakur added a comment - This is turning out to be interesting one!
          Hide
          Michael Feiman added a comment -

          It would be nice to have fast local read implemented in C++ libhdfs patch directly (not in JNI for Java interface), then FUSE integration could benefit from the "native" speed up (without JNI slowness) and for mounted FS the local read speed is specially important. Security can be simplified as FUSE client to HDFS is mostly running as root.

          Show
          Michael Feiman added a comment - It would be nice to have fast local read implemented in C++ libhdfs patch directly (not in JNI for Java interface), then FUSE integration could benefit from the "native" speed up (without JNI slowness) and for mounted FS the local read speed is specially important. Security can be simplified as FUSE client to HDFS is mostly running as root.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I'm not completely sure, but looks this will be affected by HDFS-997.

          Show
          Vinod Kumar Vavilapalli added a comment - I'm not completely sure, but looks this will be affected by HDFS-997 .
          Hide
          Todd Lipcon added a comment -

          It actually shouldn't The permission check happens inside the DN process, and then the fd (not path name) is transfered over the pipe directly to the other process. I tested this mechanism between processes run by different users, and the permissions check only happens at open() time for the opening process.

          That said, I've put this aside for now to work on some more pressing issues - perhaps will try to pick it back up some time this spring.

          Show
          Todd Lipcon added a comment - It actually shouldn't The permission check happens inside the DN process, and then the fd (not path name) is transfered over the pipe directly to the other process. I tested this mechanism between processes run by different users, and the permissions check only happens at open() time for the opening process. That said, I've put this aside for now to work on some more pressing issues - perhaps will try to pick it back up some time this spring.
          Hide
          Sanjay Radia added a comment -

          Dhruba,
          you had mentioned that you have a prototype for this local optimization. Could you please share the performance improvement observed and your approach.

          Show
          Sanjay Radia added a comment - Dhruba, you had mentioned that you have a prototype for this local optimization. Could you please share the performance improvement observed and your approach.
          Hide
          dhruba borthakur added a comment -

          This is a patch that I have done to the 0.20 branch to improve hbase performance.

          HBase caches the read-blocks from HDFS in a block cache. If the block is local, it is also cached by the local filesystem in the OS cache. I am internally referring this patch as the "Killer of the HBase block cache". Our experiments are still continuing, but this patch (and a bunch of other habse patches) will probably mean that HBase does not need to cache local blocks in the block cache anymore! thanks is a huge huge gain!

          Show
          dhruba borthakur added a comment - This is a patch that I have done to the 0.20 branch to improve hbase performance. HBase caches the read-blocks from HDFS in a block cache. If the block is local, it is also cached by the local filesystem in the OS cache. I am internally referring this patch as the "Killer of the HBase block cache". Our experiments are still continuing, but this patch (and a bunch of other habse patches) will probably mean that HBase does not need to cache local blocks in the block cache anymore! thanks is a huge huge gain!
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba. Your patch looks like the same design as the original one experimented with on this JIRA, which doesn't take into account security, etc.

          I still think the design I outlined once-upon-a-time using fd-passing over UNIX domain sockets is the right approach. Do you disagree?

          Show
          Todd Lipcon added a comment - Hey Dhruba. Your patch looks like the same design as the original one experimented with on this JIRA, which doesn't take into account security, etc. I still think the design I outlined once-upon-a-time using fd-passing over UNIX domain sockets is the right approach. Do you disagree?
          Hide
          dhruba borthakur added a comment -

          I agree with Todd's observation that if we want an application to get the same speed from a local replica of HDFS as from the local file system, then we need to somehow shortcircuit the read to go directly to the local filesystem by bypassing the datanode completely. UNIX domain sockets would work well.

          Show
          dhruba borthakur added a comment - I agree with Todd's observation that if we want an application to get the same speed from a local replica of HDFS as from the local file system, then we need to somehow shortcircuit the read to go directly to the local filesystem by bypassing the datanode completely. UNIX domain sockets would work well.
          Hide
          stack added a comment -

          "Killer of the HBase block cache"

          Sounds excellent. Did your patch improve on latency Dhruba? How was CPU usage?

          @Todd Going the unix domain sockets high road, it seemed like performance (and CPU) was no different to going the old DN low road. The mystery as to why has yet to be solved?

          Show
          stack added a comment - "Killer of the HBase block cache" Sounds excellent. Did your patch improve on latency Dhruba? How was CPU usage? @Todd Going the unix domain sockets high road, it seemed like performance (and CPU) was no different to going the old DN low road. The mystery as to why has yet to be solved?
          Hide
          dhruba borthakur added a comment -

          The datanode was earlier using 2 - 3 complete CPUs ( >200% cpu usage). After this patch, the cpu consumption by the datanode reduced to 20%.

          Show
          dhruba borthakur added a comment - The datanode was earlier using 2 - 3 complete CPUs ( >200% cpu usage). After this patch, the cpu consumption by the datanode reduced to 20%.
          Hide
          stack added a comment -

          The datanode was earlier using 2 - 3 complete CPUs...

          Thats radical.

          Show
          stack added a comment - The datanode was earlier using 2 - 3 complete CPUs... Thats radical.
          Hide
          dhruba borthakur added a comment -

          and the fsreadlatency latency dropped from 40 ms (via the datanode) to a few microseconds when going straight to the local ext3 filesystem.

          Show
          dhruba borthakur added a comment - and the fsreadlatency latency dropped from 40 ms (via the datanode) to a few microseconds when going straight to the local ext3 filesystem.
          Hide
          George Porter added a comment -

          This is great news. I'm so glad to see these performance improvements. One question about the unix domain socket approach to security. Is support for Windows in Hadoop organized such that implementations of components like HDFS can have OS-specific optimizations? In other words, would there be any pushback for tying this optimization to posix/unix platforms?

          Show
          George Porter added a comment - This is great news. I'm so glad to see these performance improvements. One question about the unix domain socket approach to security. Is support for Windows in Hadoop organized such that implementations of components like HDFS can have OS-specific optimizations? In other words, would there be any pushback for tying this optimization to posix/unix platforms?
          Hide
          Allen Wittenauer added a comment -

          This is a performance enhancement and not a core feature. This makes it OK in my book to ignore the platform dependencies. My big concern is how this plays into security. If the UNIX domain socket method can be built in a such a way that local processes can't bypass HDFS authorization then we're good.

          (Never mind the fact that there appear to be only a handful of people that really care-lip service doesn't count-about anything beyond the Linux universe... )

          Show
          Allen Wittenauer added a comment - This is a performance enhancement and not a core feature. This makes it OK in my book to ignore the platform dependencies. My big concern is how this plays into security. If the UNIX domain socket method can be built in a such a way that local processes can't bypass HDFS authorization then we're good. (Never mind the fact that there appear to be only a handful of people that really care- lip service doesn't count -about anything beyond the Linux universe... )
          Hide
          Todd Lipcon added a comment -

          If the UNIX domain socket method can be built in a such a way that local processes can't bypass HDFS authorization then we're good

          Yes, it can be built that way - my original patch on this issue did check block tokens before exposing the fds to the client.

          In other words, would there be any pushback for tying this optimization to posix/unix platforms

          Domain sockets do exist in some similar form on Windows, I believe. It would probably need slightly different native code. As Allen says, though, as an optimization I don't think it's problematic to provide it only for Linux. If someone spends the time to write a similar function against Windows APIs, I don't have any problem with checking it in.

          Show
          Todd Lipcon added a comment - If the UNIX domain socket method can be built in a such a way that local processes can't bypass HDFS authorization then we're good Yes, it can be built that way - my original patch on this issue did check block tokens before exposing the fds to the client. In other words, would there be any pushback for tying this optimization to posix/unix platforms Domain sockets do exist in some similar form on Windows, I believe. It would probably need slightly different native code. As Allen says, though, as an optimization I don't think it's problematic to provide it only for Linux. If someone spends the time to write a similar function against Windows APIs, I don't have any problem with checking it in.
          Hide
          ryan rawson added a comment -

          dhruba, I am not seeing the file src/hdfs/org/apache/hadoop/hdfs/metrics/DFSClientMetrics.java in branch-20-append (nor cdh3b2). I also got a number of rejects, here are some highlights:

          ClientDatanodeProtocol, your variant has copyBlock, ours does not (hence the rej).
          Misc field differences in DFSClient, including the metrics object

          After resolving them I was able to get it up and going.

          I'm not able to get the unit test to pass, I'm guessing it's this:
          2011-02-09 14:35:49,926 DEBUG hdfs.DFSClient (DFSClient.java:fetchBlockByteRange(1927)) - fetchBlockByteRange shortCircuitLocalReads true localhst h132.sfo.stumble.net/10.10.1.132 targetAddr /127.0.0.1:62665

          Since we don't recognize that we are 'local', we do the normal read path which is failing. Any tips?

          Show
          ryan rawson added a comment - dhruba, I am not seeing the file src/hdfs/org/apache/hadoop/hdfs/metrics/DFSClientMetrics.java in branch-20-append (nor cdh3b2). I also got a number of rejects, here are some highlights: ClientDatanodeProtocol, your variant has copyBlock, ours does not (hence the rej). Misc field differences in DFSClient, including the metrics object After resolving them I was able to get it up and going. I'm not able to get the unit test to pass, I'm guessing it's this: 2011-02-09 14:35:49,926 DEBUG hdfs.DFSClient (DFSClient.java:fetchBlockByteRange(1927)) - fetchBlockByteRange shortCircuitLocalReads true localhst h132.sfo.stumble.net/10.10.1.132 targetAddr /127.0.0.1:62665 Since we don't recognize that we are 'local', we do the normal read path which is failing. Any tips?
          Hide
          ryan rawson added a comment -

          Applying this patch to branch-20-append and the unit test passes. Still trying to figure out why it works on one thing and not on the other. The patch is pretty dang simple too.

          Show
          ryan rawson added a comment - Applying this patch to branch-20-append and the unit test passes. Still trying to figure out why it works on one thing and not on the other. The patch is pretty dang simple too.
          Hide
          ryan rawson added a comment -

          ok this was my bad, i applied the patch wrong. unit test passes. I'll attach a patch for others

          Show
          ryan rawson added a comment - ok this was my bad, i applied the patch wrong. unit test passes. I'll attach a patch for others
          Hide
          ryan rawson added a comment -

          applies to head of branch-20-append

          Show
          ryan rawson added a comment - applies to head of branch-20-append
          Hide
          dhruba borthakur added a comment -

          Thanks Ryan for merging that patch to the head of 0.20-append branch. Please do let me know if you see any problems with it.

          I agree with Allen/Todd that since Todd's patch is an optimization, we can get it committed even if this optimization does not work on non-linux platforms. Can some security guru review the security aspects of it?

          Show
          dhruba borthakur added a comment - Thanks Ryan for merging that patch to the head of 0.20-append branch. Please do let me know if you see any problems with it. I agree with Allen/Todd that since Todd's patch is an optimization, we can get it committed even if this optimization does not work on non-linux platforms. Can some security guru review the security aspects of it?
          Hide
          Allen Wittenauer added a comment -

          Short Story:

          I need to look at the patch but keep in mind this should work on pretty every modern UNIX and UNIX-like system. Heck, even less-than-mainstream-these-days OSes like Tru64 support these interfaces. So we should make sure that this optimization isn't escaped off to only work on Linux machines.

          Long Story:

          If I remember my history correctly, UNIX domain sockets were originally SVID and "bumped up" to be a POSIX standard later on. So any System V-compliant OS should work--thus covering HP-UX and Solaris. The BSDs and Linux kernels are covered when one compiles with the STREAMS interfaces enabled, IIRC. (In other words, someone using Gentoo or a hand-built FreeBSD install might need to recompile their kernel). Apple does appear to compile in STREAMS for at least 10.6. I think AIX falls under SystemV based (even though I worked on it for years, I'm still not sure of its origin. )

          I'm not sure if cygwin or Windows natively support STREAMS. So that should be the only OS this change may not work on.

          Show
          Allen Wittenauer added a comment - Short Story: I need to look at the patch but keep in mind this should work on pretty every modern UNIX and UNIX-like system. Heck, even less-than-mainstream-these-days OSes like Tru64 support these interfaces. So we should make sure that this optimization isn't escaped off to only work on Linux machines. Long Story: If I remember my history correctly, UNIX domain sockets were originally SVID and "bumped up" to be a POSIX standard later on. So any System V-compliant OS should work--thus covering HP-UX and Solaris. The BSDs and Linux kernels are covered when one compiles with the STREAMS interfaces enabled, IIRC. (In other words, someone using Gentoo or a hand-built FreeBSD install might need to recompile their kernel). Apple does appear to compile in STREAMS for at least 10.6. I think AIX falls under SystemV based (even though I worked on it for years, I'm still not sure of its origin. ) I'm not sure if cygwin or Windows natively support STREAMS. So that should be the only OS this change may not work on.
          Hide
          Owen O'Malley added a comment -

          Allen, the question is more whether passing fds down the unix domain socket works on those platforms. It seems fine to have platform specific optimizations as long as they are handled cleanly.

          I haven't looked at the current patch, but from a high level it should be fine as long as the block access token is validated before the fd is handed out. One thing that absolutely needs to be checked is whether fdopen (or any other code path that lets you modify the read/write access on a fd) will work on the passed across fd. If so, it is a show stopper since HDFS can't handle users modifying block replicas.

          From a technical point of view, the client will also need to handle read failures on the fd and go back to the Data Node to continue reading, since the block may be migrated to a different machine.

          What is the percentage of local reads in HBase? In MapReduce jobs, the time spent reading from the local machine is a small percentage of the total job time and the complexity would be better spent making the remote read and write faster.

          Show
          Owen O'Malley added a comment - Allen, the question is more whether passing fds down the unix domain socket works on those platforms. It seems fine to have platform specific optimizations as long as they are handled cleanly. I haven't looked at the current patch, but from a high level it should be fine as long as the block access token is validated before the fd is handed out. One thing that absolutely needs to be checked is whether fdopen (or any other code path that lets you modify the read/write access on a fd) will work on the passed across fd. If so, it is a show stopper since HDFS can't handle users modifying block replicas. From a technical point of view, the client will also need to handle read failures on the fd and go back to the Data Node to continue reading, since the block may be migrated to a different machine. What is the percentage of local reads in HBase? In MapReduce jobs, the time spent reading from the local machine is a small percentage of the total job time and the complexity would be better spent making the remote read and write faster.
          Hide
          Jonathan Gray added a comment -

          A very large percentage of reads are local in HBase. Background compactions ensure there are local replicas and period major compactions bring all data local for all regions/servers. Early versions of this patch have already significantly helped performance for us.

          Show
          Jonathan Gray added a comment - A very large percentage of reads are local in HBase. Background compactions ensure there are local replicas and period major compactions bring all data local for all regions/servers. Early versions of this patch have already significantly helped performance for us.
          Hide
          stack added a comment -

          @Owen Remote reads are the exception.

          Show
          stack added a comment - @Owen Remote reads are the exception.
          Hide
          Todd Lipcon added a comment -

          One thing that absolutely needs to be checked is whether fdopen (or any other code path that lets you modify the read/write access on a fd) will work on the passed across fd. If so, it is a show stopper since HDFS can't handle users modifying block replicas

          I believe I checked this last year when I was first working on this issue, and found that there was no way to escalate a read-only fd to a read-write fd. According to the fdopen() manpage:

          The mode of the stream (one of the values "r", "r+",
          "w", "w+", "a", "a+") must be compatible with the mode of the file
          descriptor.

          • ie it just creates a FILE * wrapper around the existing fd.

          From a technical point of view, the client will also need to handle read failures on the fd and go back to the Data Node to continue reading, since the block may be migrated to a different machine

          Yes, iirc this was also maintained in the patch. Of course the patch is long out of date now, but I agree wholeheartedly.

          Show
          Todd Lipcon added a comment - One thing that absolutely needs to be checked is whether fdopen (or any other code path that lets you modify the read/write access on a fd) will work on the passed across fd. If so, it is a show stopper since HDFS can't handle users modifying block replicas I believe I checked this last year when I was first working on this issue, and found that there was no way to escalate a read-only fd to a read-write fd. According to the fdopen() manpage: The mode of the stream (one of the values "r", "r+", "w", "w+", "a", "a+") must be compatible with the mode of the file descriptor. ie it just creates a FILE * wrapper around the existing fd. From a technical point of view, the client will also need to handle read failures on the fd and go back to the Data Node to continue reading, since the block may be migrated to a different machine Yes, iirc this was also maintained in the patch. Of course the patch is long out of date now, but I agree wholeheartedly.
          Hide
          dhruba borthakur added a comment -

          almost all reads (> 99%) in hbase are node-local.

          Show
          dhruba borthakur added a comment - almost all reads (> 99%) in hbase are node-local.
          Hide
          ryan rawson added a comment -

          In a test with 15 threads of HBase clients, latency goes from 12.1 ms -> 6.9 ms with this patch. Based on my report to user@hbase list, there are a few people who are pulling down my patched hadoop variant and want to test and run with it. Based on the iceberg theory of interest, this is one of the hottest things I've seen people want, and want NOW in a while.

          Show
          ryan rawson added a comment - In a test with 15 threads of HBase clients, latency goes from 12.1 ms -> 6.9 ms with this patch. Based on my report to user@hbase list, there are a few people who are pulling down my patched hadoop variant and want to test and run with it. Based on the iceberg theory of interest, this is one of the hottest things I've seen people want, and want NOW in a while.
          Hide
          Nathan Roberts added a comment -

          We have been looking closely at the capability introduced in this Jira because the initial results look very promising. However, after looking deeper, I’m not convinced this is an approach that makes the most sense at this time. This Jira is all about getting the maximum performance when the blocks of a file are on the local node. Obviously performance of this use case is a critical piece of “move computation to data”. However, if going through the datanode were to offer the same level of performance as going direct at the files, then this Jira wouldn’t even exist. So, I think it’s really important for us to understand the performance benefits of going direct and the real root causes of any performance differences between going direct and having the data flow through the datanode. Once that is well understood, then I think we could look at the value proposition of this change. We’ve tried to do some of this analysis and the results follow. Key word here is “some”. I feel we’ve gathered enough data to draw some valuable conclusions, but I don’t think it’s enough data to say this type of approach wouldn’t be worth pursuing down the road.

          For the impatient, the paragraphs below can be summarized with the following points:
          + Going through the datanode maintains architectural layering. All other things being equal, it would be best to avoid exposing the internal details of how the datanode maintains its data. Violations of this layering could paint us into a corner down the road and therefore should be avoided.
          + Benchmarked localhost sockets at 650MB/sec (write->read) and 1.6GB/sec(sendfile->read). nc uses 1K buffers and this probably explains the low bandwidth observed as part of this jira.
          + Measured maximum client ingest rate at 280MB/sec for sockets. Checksum calculation seems to play a big part of this limit.
          + Measured maximum datanode streaming output rate of 827MB/sec.
          + Measured maximum datanode random read output rate of 221MB/sec (with hdfs-941).
          + The maximum client ingest rate of 280MB/sec is significantly slower than the maximum datanode streaming output rate of 827MB/sec and only marginally faster than the maximum datanode random output rate of 221MB/sec. This seems to say that with the current bottlenecks, there isn’t a ton of performance to be gained from going direct, at least not for the simple test cases used here.

          For the detail oriented, keep reading.

          If everything were optimized in the system then going direct is certainly going to have a performance advantage (less layers means higher top-end performance). However, the questions are:
          + How much of a performance gain?
          + Can this gain be realized with existing use cases?
          + Is the gain worth the layering violations? For example, what if we decided to automatically merge small blks into single files? In order to access this data directly, both the datanode and the client side code would have to be cognizant of this format. Or what if we wanted to support encrypted content? Or if we wanted to handle I/O errors differently than they’re handled today? I’m sure there are others I’m not thinking of.

          Ok, now for some data.

          One of the initial comments talked about overhead of localhost network connections. The comment used nc to measure bandwidth through a socket vs bandwidth through a pipe. We looked into this a little because this was a bit surprising. Sure enough on my rhel5 system, I saw pretty much the same numbers. Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good for throughput. So, we ran lmbench on the same system to see what sort of results we get. localhost sockets and pipes both came in right around 660MB/sec with 64K blocksizes. Pipes will probably scale up a bit better across more cores but I would not expect to see a 5x difference as the original nc experiment showed. We also modified lmbench to use sendfile() instead of write() in the local socket test and measured this throughput to be 1.6GB/sec.

          CONCLUSION: A localhost socket should be able to move around 650MB/sec for write->read, and 1.6GB/sec for sendfile->read.

          The remaining results involve hdfs. In these tests the blks being read are all in the kernel page cache. This was done to completely remove disk seek latencies from the equation and to completely highlight any datanode overheads. io.file.buffer.size was 64K in all tests. (Todd measured a 30% improvement using the direct method with checksums enabled. I can’t completely reconcile this improvement with the results below but I’m wondering if it’s due to that test using the default of 4K buffers??? I think the results of that test would be consistent with the results below if that were the case. In any event it would be good to reconcile the differences at some point.)

          The next piece of data we wanted was the maximum rate at which the client can ingest data. The first thing we did was to run a simple streaming read. In this case we saw about 280 MB/sec. This is nowhere near 1.6GB/sec so the bottleneck must be either the client and/or the server (i.e. it’s not the pipe). The client process was at 100% CPU, so it’s probably there. To verify, we disabled checksum verification on the client and this number went up to 776MB/sec and client CPU utilization was still 100%. The bottleneck appears to still be at the client. This is most likely due to the fact that the client has to actually copy the data out of the kernel while the datanode uses sendfile.

          CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec. Datanode is capable of streaming out at least 776MB/sec. Given current client code, there would not be a significant advantage to going direct to the file because checksum calculation and other client overheads limit its ingestion rate to 285MB/sec and the datanode is easily capable of sustaining this rate for streaming reads.

          The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode for this particular use case so this could be a place where direct access could really excel. The first thing we did here was run a simple random read test to again measure the maximum read throughput. In this case we measured 105MB/sec. Again we tried to eliminate the bottlenecks. However, it’s more complicated in the random read case due to the fact that it is a request/response type of protocol. So, first we focused on the datanode. hdfs-941 is a proposed change which helps the pread use case significantly. The implementation in 941 seems very reasonable and looks to be wrapping up very soon. So, we applied the 941 patch and this improved the throughput to 143MB/sec.

          This isn’t at the 285MB/sec yet so it’s still conceivable that going direct could add a nice boost.

          Since this is a request/response protocol, the checksum processing on the client will impact the overall throughput of random I/O use cases. With checksums disabled, the random I/O throughput increased from 143MB/sec to 221MB/sec.

          CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured 827MB/sec for no-checksum streaming reads. The datanode is currently not capable of maxing out a localhost socket.
          CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily reached with streaming reads. For random reads, with HDFS-941, this rate is a bit faster (280MB/sec vs 221MB/sec) but not dramatically so. Therefore, for today the right approach seems to be to enhance the datanode to make sure the bottleneck is squarely at the client. Since the bottleneck is mainly due to checksum calculation and data copies out of the kernel, going direct to a blk file shouldn’t have a significant impact because both of these overhead activities need to be performed whether going direct or not.

          The results above are all in terms of single reader throughput of cached blk files. More scalability testing needs to be performed. We did verify that on a dual-quad core system that the datanode could scale its random read throughput from 137MB/sec to 480MB/sec with 4 readers. This was enough load to saturate 5 of the 8 cores with clients consuming 3 and datanodes consuming 2. It’s just one data point, there’s lots more work to be done in the area of datanode scalability.

          Latency is also a critical attribute of the datanode and some more data needs to be gathered in this area. However, I propose we focus on fixing any contention/latency issues within the datanode prior to diving into a direct I/O sort of approach (and there are already a few jiras out there that are in the area of improving concurrency within the datanode). If we can’t get anywhere near the latency requirements, then at that point we should consider more efficient ways of getting at the data.

          Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data gathering! Gathering this type of data always seems to take longer than one would think, so thank you for the efforts.

          Show
          Nathan Roberts added a comment - We have been looking closely at the capability introduced in this Jira because the initial results look very promising. However, after looking deeper, I’m not convinced this is an approach that makes the most sense at this time. This Jira is all about getting the maximum performance when the blocks of a file are on the local node. Obviously performance of this use case is a critical piece of “move computation to data”. However, if going through the datanode were to offer the same level of performance as going direct at the files, then this Jira wouldn’t even exist. So, I think it’s really important for us to understand the performance benefits of going direct and the real root causes of any performance differences between going direct and having the data flow through the datanode. Once that is well understood, then I think we could look at the value proposition of this change. We’ve tried to do some of this analysis and the results follow. Key word here is “some”. I feel we’ve gathered enough data to draw some valuable conclusions, but I don’t think it’s enough data to say this type of approach wouldn’t be worth pursuing down the road. For the impatient, the paragraphs below can be summarized with the following points: + Going through the datanode maintains architectural layering. All other things being equal, it would be best to avoid exposing the internal details of how the datanode maintains its data. Violations of this layering could paint us into a corner down the road and therefore should be avoided. + Benchmarked localhost sockets at 650MB/sec (write->read) and 1.6GB/sec(sendfile->read). nc uses 1K buffers and this probably explains the low bandwidth observed as part of this jira. + Measured maximum client ingest rate at 280MB/sec for sockets. Checksum calculation seems to play a big part of this limit. + Measured maximum datanode streaming output rate of 827MB/sec. + Measured maximum datanode random read output rate of 221MB/sec (with hdfs-941). + The maximum client ingest rate of 280MB/sec is significantly slower than the maximum datanode streaming output rate of 827MB/sec and only marginally faster than the maximum datanode random output rate of 221MB/sec. This seems to say that with the current bottlenecks, there isn’t a ton of performance to be gained from going direct, at least not for the simple test cases used here. For the detail oriented, keep reading. If everything were optimized in the system then going direct is certainly going to have a performance advantage (less layers means higher top-end performance). However, the questions are: + How much of a performance gain? + Can this gain be realized with existing use cases? + Is the gain worth the layering violations? For example, what if we decided to automatically merge small blks into single files? In order to access this data directly, both the datanode and the client side code would have to be cognizant of this format. Or what if we wanted to support encrypted content? Or if we wanted to handle I/O errors differently than they’re handled today? I’m sure there are others I’m not thinking of. Ok, now for some data. One of the initial comments talked about overhead of localhost network connections. The comment used nc to measure bandwidth through a socket vs bandwidth through a pipe. We looked into this a little because this was a bit surprising. Sure enough on my rhel5 system, I saw pretty much the same numbers. Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good for throughput. So, we ran lmbench on the same system to see what sort of results we get. localhost sockets and pipes both came in right around 660MB/sec with 64K blocksizes. Pipes will probably scale up a bit better across more cores but I would not expect to see a 5x difference as the original nc experiment showed. We also modified lmbench to use sendfile() instead of write() in the local socket test and measured this throughput to be 1.6GB/sec. CONCLUSION: A localhost socket should be able to move around 650MB/sec for write->read, and 1.6GB/sec for sendfile->read. The remaining results involve hdfs. In these tests the blks being read are all in the kernel page cache. This was done to completely remove disk seek latencies from the equation and to completely highlight any datanode overheads. io.file.buffer.size was 64K in all tests. (Todd measured a 30% improvement using the direct method with checksums enabled. I can’t completely reconcile this improvement with the results below but I’m wondering if it’s due to that test using the default of 4K buffers??? I think the results of that test would be consistent with the results below if that were the case. In any event it would be good to reconcile the differences at some point.) The next piece of data we wanted was the maximum rate at which the client can ingest data. The first thing we did was to run a simple streaming read. In this case we saw about 280 MB/sec. This is nowhere near 1.6GB/sec so the bottleneck must be either the client and/or the server (i.e. it’s not the pipe). The client process was at 100% CPU, so it’s probably there. To verify, we disabled checksum verification on the client and this number went up to 776MB/sec and client CPU utilization was still 100%. The bottleneck appears to still be at the client. This is most likely due to the fact that the client has to actually copy the data out of the kernel while the datanode uses sendfile. CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec. Datanode is capable of streaming out at least 776MB/sec. Given current client code, there would not be a significant advantage to going direct to the file because checksum calculation and other client overheads limit its ingestion rate to 285MB/sec and the datanode is easily capable of sustaining this rate for streaming reads. The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode for this particular use case so this could be a place where direct access could really excel. The first thing we did here was run a simple random read test to again measure the maximum read throughput. In this case we measured 105MB/sec. Again we tried to eliminate the bottlenecks. However, it’s more complicated in the random read case due to the fact that it is a request/response type of protocol. So, first we focused on the datanode. hdfs-941 is a proposed change which helps the pread use case significantly. The implementation in 941 seems very reasonable and looks to be wrapping up very soon. So, we applied the 941 patch and this improved the throughput to 143MB/sec. This isn’t at the 285MB/sec yet so it’s still conceivable that going direct could add a nice boost. Since this is a request/response protocol, the checksum processing on the client will impact the overall throughput of random I/O use cases. With checksums disabled, the random I/O throughput increased from 143MB/sec to 221MB/sec. CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured 827MB/sec for no-checksum streaming reads. The datanode is currently not capable of maxing out a localhost socket. CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily reached with streaming reads. For random reads, with HDFS-941 , this rate is a bit faster (280MB/sec vs 221MB/sec) but not dramatically so. Therefore, for today the right approach seems to be to enhance the datanode to make sure the bottleneck is squarely at the client. Since the bottleneck is mainly due to checksum calculation and data copies out of the kernel, going direct to a blk file shouldn’t have a significant impact because both of these overhead activities need to be performed whether going direct or not. The results above are all in terms of single reader throughput of cached blk files. More scalability testing needs to be performed. We did verify that on a dual-quad core system that the datanode could scale its random read throughput from 137MB/sec to 480MB/sec with 4 readers. This was enough load to saturate 5 of the 8 cores with clients consuming 3 and datanodes consuming 2. It’s just one data point, there’s lots more work to be done in the area of datanode scalability. Latency is also a critical attribute of the datanode and some more data needs to be gathered in this area. However, I propose we focus on fixing any contention/latency issues within the datanode prior to diving into a direct I/O sort of approach (and there are already a few jiras out there that are in the area of improving concurrency within the datanode). If we can’t get anywhere near the latency requirements, then at that point we should consider more efficient ways of getting at the data. Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data gathering! Gathering this type of data always seems to take longer than one would think, so thank you for the efforts.
          Hide
          Nathan Roberts added a comment -

          With the work on hdfs-2080, I'd be really curious to see a benchmark with/without HDFS-347. With some of the other bottlenecks (HDFS-941 and HDFS-2080) out of the way, we'd be close to an apples-apples comparison.

          Show
          Nathan Roberts added a comment - With the work on hdfs-2080, I'd be really curious to see a benchmark with/without HDFS-347 . With some of the other bottlenecks ( HDFS-941 and HDFS-2080 ) out of the way, we'd be close to an apples-apples comparison.
          Hide
          dhruba borthakur added a comment -

          My observation has been that it is the high CPU usage on the datanodes that was causing performance degradation while doing random reads from HDFS (local block). I have 400 threads in hbase that are doing random reads from a bunch of files in HDFS.

          Show
          dhruba borthakur added a comment - My observation has been that it is the high CPU usage on the datanodes that was causing performance degradation while doing random reads from HDFS (local block). I have 400 threads in hbase that are doing random reads from a bunch of files in HDFS.
          Hide
          Nathan Roberts added a comment -

          Do you run with HDFS-941? How many random reads per second are you hitting hdfs with? iirc we see close to 10K 64K reads per second with ~2 cores.

          Show
          Nathan Roberts added a comment - Do you run with HDFS-941 ? How many random reads per second are you hitting hdfs with? iirc we see close to 10K 64K reads per second with ~2 cores.
          Hide
          dhruba borthakur added a comment -

          we do not run with HDFS-941. I will post numbers once I get that incorporated into our production environment.

          Show
          dhruba borthakur added a comment - we do not run with HDFS-941 . I will post numbers once I get that incorporated into our production environment.
          Hide
          Doug Meil added a comment -

          Can this ticket be closed out now, since this "local read" feature has been implemented in HDFS-2246?

          Show
          Doug Meil added a comment - Can this ticket be closed out now, since this "local read" feature has been implemented in HDFS-2246 ?
          Hide
          Eli Collins added a comment -

          No, see the 1st comment of HDFS-2246:

          HDFS-347 discusses ways to optimize reads for local clients. A clean design is fairly involved. A shortcut has been proposed where the client access the hdfs file blocks directly; this works if the client is the same user/group as the DN daemon. This is non-invasive and is a good short term solution till HDFS-347 is completed.

          In short there are some limitations with HDFS-2246 that we should address (eg only works with a single user).

          Show
          Eli Collins added a comment - No, see the 1st comment of HDFS-2246 : HDFS-347 discusses ways to optimize reads for local clients. A clean design is fairly involved. A shortcut has been proposed where the client access the hdfs file blocks directly; this works if the client is the same user/group as the DN daemon. This is non-invasive and is a good short term solution till HDFS-347 is completed. In short there are some limitations with HDFS-2246 that we should address (eg only works with a single user).
          Hide
          Colin Patrick McCabe added a comment -
          Show
          Colin Patrick McCabe added a comment - testable patch including HDFS-3753 , HADOOP-6311 and HDFS-347
          Hide
          Colin Patrick McCabe added a comment -

          This patch only includes HDFS-347.

          • DataChecksum#newDataChecksum: correctly handle offset values other than 0.
          • BlockReader / BlockReaderUtil: add skipFully and available methods. Add JavaDoc for skip method. The available method returns a rough approximation of how much data might be available without doing any more network I/O. This helps us optimize in the case where we are reading from a local file descriptor, since we never do network I/O in that case.
          • BlockReaderLocal: simpler implementation that uses raw FileChannel objects. We don't need to cache anything, or make RPCs to the DataNode.
          • DFSClient / DFSInputStream: update getLocalBlockReader to work with fd passing. Rather than overloading AccessControlException to mean "local reads were not enabled," create a new exception called LocalReadsDisabledException and throw it when that is the case. This will prevent confusion in the future. Use skipFully instead of skip, since the latter may give us short skips.
          • DFSConfigKeys: don't need dfs.block.local-path-access.user any more. Local reads are now on by default rather than disabled by default.
          • RPC stuff: add BlockLocalFdInfo. Deprecate BlockLocalPathInfo. Implement the DataNode, FsDatasetIMpl, etc. methods. Add GetBlockLocalFdInfoResponseProto. The old RPC is now deprecated and will always throw an AccessControlException, so that older clients will fall back to remote reads.
          • MiniDFSCluster: add getBlockMetadataFile which is like getBlockFile except that it returns .meta files.
          • Tests: TestBlockReaderLocal now includes more tests of BlockReaderLocal in isolation. TestParallelRead now explictly disables local reads (that case is testsed by TestParalellLocalRead). TestShortCircuitLocalRead: add testDeprecatedGetBlockLocalPathInfoRpc to test the deprecated RPC.
          Show
          Colin Patrick McCabe added a comment - This patch only includes HDFS-347 . DataChecksum#newDataChecksum: correctly handle offset values other than 0. BlockReader / BlockReaderUtil: add skipFully and available methods. Add JavaDoc for skip method. The available method returns a rough approximation of how much data might be available without doing any more network I/O. This helps us optimize in the case where we are reading from a local file descriptor, since we never do network I/O in that case. BlockReaderLocal: simpler implementation that uses raw FileChannel objects. We don't need to cache anything, or make RPCs to the DataNode. DFSClient / DFSInputStream: update getLocalBlockReader to work with fd passing. Rather than overloading AccessControlException to mean "local reads were not enabled," create a new exception called LocalReadsDisabledException and throw it when that is the case. This will prevent confusion in the future. Use skipFully instead of skip, since the latter may give us short skips. DFSConfigKeys: don't need dfs.block.local-path-access.user any more. Local reads are now on by default rather than disabled by default. RPC stuff: add BlockLocalFdInfo. Deprecate BlockLocalPathInfo. Implement the DataNode, FsDatasetIMpl, etc. methods. Add GetBlockLocalFdInfoResponseProto. The old RPC is now deprecated and will always throw an AccessControlException, so that older clients will fall back to remote reads. MiniDFSCluster: add getBlockMetadataFile which is like getBlockFile except that it returns .meta files. Tests: TestBlockReaderLocal now includes more tests of BlockReaderLocal in isolation. TestParallelRead now explictly disables local reads (that case is testsed by TestParalellLocalRead). TestShortCircuitLocalRead: add testDeprecatedGetBlockLocalPathInfoRpc to test the deprecated RPC.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547273/HDFS-347.016.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified test files.

          -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547273/HDFS-347.016.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 11 new or modified test files. -1 javac . The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3251//console This message is automatically generated.
          Hide
          Andy Isaacson added a comment -
          +  iov.iov_base = (void*)&resp;
          +  iov.iov_len = 1;
          

          This should be iov_base = resp; iov_len = sizeof(resp);. Neither change is semantically important but they're both more philosophically correct: avoid unnecessary casts, the undecorated array gives its address, and len is the size of the buffer.

          +  RETRY_ON_EINTR(res, recvmsg(sock, &socketMsg, 0));
          +  if (res < 0) {
          +    ret = errno;
          +    return newNativeIOException(env, ret,
          +        "FdClient::recvFd(sockPath=%s): recv "
          

          Since we called recvmsg we should say recvmsg not recv in the exception.

          +  } else if (res == 0) {
          +    ret = errno;
          

          recvmsg does not chanage errno when it returns 0, so this makes for an unpredictable return value.

          ... ah, I see we don't actually use ret below here, so let's just delete it in this else case.

          More later.

          Show
          Andy Isaacson added a comment - + iov.iov_base = (void*)&resp; + iov.iov_len = 1; This should be iov_base = resp; iov_len = sizeof(resp); . Neither change is semantically important but they're both more philosophically correct: avoid unnecessary casts, the undecorated array gives its address, and len is the size of the buffer. + RETRY_ON_EINTR(res, recvmsg(sock, &socketMsg, 0)); + if (res < 0) { + ret = errno; + return newNativeIOException(env, ret, + "FdClient::recvFd(sockPath=%s): recv " Since we called recvmsg we should say recvmsg not recv in the exception. + } else if (res == 0) { + ret = errno; recvmsg does not chanage errno when it returns 0, so this makes for an unpredictable return value. ... ah, I see we don't actually use ret below here, so let's just delete it in this else case. More later.
          Hide
          Colin Patrick McCabe added a comment -

          This verison fixes some bugs with the fallback case where the JNI libraries are not installed.

          It also adds a few more junit tests.

          Show
          Colin Patrick McCabe added a comment - This verison fixes some bugs with the fallback case where the JNI libraries are not installed. It also adds a few more junit tests.
          Hide
          Colin Patrick McCabe added a comment -
          • The combined version for Jenkins to test.
          Show
          Colin Patrick McCabe added a comment - The combined version for Jenkins to test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547572/HDFS-347.017.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestBlockReaderLocal
          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547572/HDFS-347.017.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. -1 javac . The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestBlockReaderLocal org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3260//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547572/HDFS-347.017.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestBlockReaderLocal
          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547572/HDFS-347.017.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. -1 javac . The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2052 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestBlockReaderLocal org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3261//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This is a bogus findbugs warning. Both FileInputStreams are unconditionally closed in FsDatasetImpl#getBlockLocalFdInfo in a finally{} block.

          Show
          Colin Patrick McCabe added a comment - This is a bogus findbugs warning. Both FileInputStreams are unconditionally closed in FsDatasetImpl#getBlockLocalFdInfo in a finally{} block.
          Hide
          Todd Lipcon added a comment -

          Even if it's bogus, you need to either figure out a way to avoid it, or add it to the findbugs exclude file. Otherwise all future builds will report it.

          Also, looks like the new unit test failed in the latest build.

          I'll review the pieces of this once it's no longer in flux, passing its unit tests, etc.

          Show
          Todd Lipcon added a comment - Even if it's bogus, you need to either figure out a way to avoid it, or add it to the findbugs exclude file. Otherwise all future builds will report it. Also, looks like the new unit test failed in the latest build. I'll review the pieces of this once it's no longer in flux, passing its unit tests, etc.
          Hide
          Colin Patrick McCabe added a comment -
          • SuppressWarnings("deprecation") seems to not work; skip it, and just don't mark BlockLocalPathInfo as @deprecated, to avoid creating a lot of warnings.
          • re-arrange the InputStream close methods in hopes of placating findbugs. This doesn't fix any bugs, but hopefully it quiets it down.
          • TestBlockReaderLocal: when reading the checksum is disabled, we should not expect to detect checksum errors.
          Show
          Colin Patrick McCabe added a comment - SuppressWarnings("deprecation") seems to not work; skip it, and just don't mark BlockLocalPathInfo as @deprecated , to avoid creating a lot of warnings. re-arrange the InputStream close methods in hopes of placating findbugs. This doesn't fix any bugs, but hopefully it quiets it down. TestBlockReaderLocal: when reading the checksum is disabled, we should not expect to detect checksum errors.
          Hide
          Colin Patrick McCabe added a comment -

          New combined HADOOP-6311 + HDFS-347 patch for jenkins.

          Show
          Colin Patrick McCabe added a comment - New combined HADOOP-6311 + HDFS-347 patch for jenkins.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547606/HDFS-347.018.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3263//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3263//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547606/HDFS-347.018.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3263//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3263//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Have you looked into the TestPipelinesFailover failure? It seems to have failed all the builds on this JIRA but hasn't been flaky in the past.

          Show
          Todd Lipcon added a comment - Have you looked into the TestPipelinesFailover failure? It seems to have failed all the builds on this JIRA but hasn't been flaky in the past.
          Hide
          Colin Patrick McCabe added a comment -

          I haven't looked at it in detail. Have you seen it fail this way before?

          Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test-13 could only be replicated to 0 nodes instead of minReplication (=1).  There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
          
          Show
          Colin Patrick McCabe added a comment - I haven't looked at it in detail. Have you seen it fail this way before? Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test-13 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
          Hide
          Philip Zeyliger added a comment -

          That error message is typically a statement that the reserved disk space requirement makes it so that the datanodes have no more space. (Or, the datanodes really don't have any more space.)

          Show
          Philip Zeyliger added a comment - That error message is typically a statement that the reserved disk space requirement makes it so that the datanodes have no more space. (Or, the datanodes really don't have any more space.)
          Hide
          Colin Patrick McCabe added a comment -

          I ran TestPipelinesFailover locally with and without this change and it did not fail. I'm not sure what's going on with it on Jenkins. I will re-submit in case the problem was a lack of disk space

          Show
          Colin Patrick McCabe added a comment - I ran TestPipelinesFailover locally with and without this change and it did not fail. I'm not sure what's going on with it on Jenkins. I will re-submit in case the problem was a lack of disk space
          Hide
          Colin Patrick McCabe added a comment -

          re-submit

          Show
          Colin Patrick McCabe added a comment - re-submit
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547606/HDFS-347.018.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3275//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3275//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547606/HDFS-347.018.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3275//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3275//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          See HDFS-3391: TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

          Show
          Colin Patrick McCabe added a comment - See HDFS-3391 : TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
          Hide
          Colin Patrick McCabe added a comment -

          Ignore the preceding comment, that JIRA was fixed already.

          Show
          Colin Patrick McCabe added a comment - Ignore the preceding comment, that JIRA was fixed already.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12548269/HDFS-347.018.patch2
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3294//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3294//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12548269/HDFS-347.018.patch2 against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3294//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3294//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          Andy Isaacson: This should be iov_base = resp; iov_len = sizeof(resp);.

          Thanks for the style suggestions. It would be better to put the comments about the C part of this into HADOOP-6311, since that JIRA focuses purely on the FD-passing side of things.

          I posted a new HADOOP-6311 patch with the style changes you suggested to that JIRA.

          Show
          Colin Patrick McCabe added a comment - Andy Isaacson: This should be iov_base = resp; iov_len = sizeof(resp);. Thanks for the style suggestions. It would be better to put the comments about the C part of this into HADOOP-6311 , since that JIRA focuses purely on the FD-passing side of things. I posted a new HADOOP-6311 patch with the style changes you suggested to that JIRA.
          Hide
          Colin Patrick McCabe added a comment -
          • DataNode#shutdown should close the fdServer so that all published file descriptors are closed.
          Show
          Colin Patrick McCabe added a comment - rebase on HDFS-347 DataNode#shutdown should close the fdServer so that all published file descriptors are closed.
          Hide
          Colin Patrick McCabe added a comment -

          er, that should read "rebase on HADOOP-6311"

          Show
          Colin Patrick McCabe added a comment - er, that should read "rebase on HADOOP-6311 "
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12548622/HDFS-347.019.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3301//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3301//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12548622/HDFS-347.019.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3301//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3301//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This is an updated patch based on a discussion we had in HADOOP-6311. Basically, the current design is to pass file descriptors over a new class named DomainSocket, which represents UNIX domain sockets. This is accomplished by adding a new message to the DataTransferProtocol, RequestShortCircuitFd.

          The DataXceiverServer can manage these UNIX domain sockets just as easily as it manages existing the IPv4 sockets, because they implement the same interfaces.

          One thing I refactored in this patch is BlockReaderFactory. It formerly contained only static methods; this patch changes it to be a "real" class with instance methods and instance data. I felt that the BlockReaderFactory methods were getting too unwieldy because we were passing a tremendous amount of parameters, many of which could be considered properties of the factory in a sense. Using instance data also allows the factory to keep a blacklist of which DataNodes do not support file descriptor passing. It uses this information to avoid making unnecesary requests.

          This patch also introduces the concept of a format version number for blocks. The idea here is that if we later change the block format on-disk, we want to be able to tell clients that they can't short-circuit access these blocks unless they can understand the corresponding version number. (One change we've talked a lot about doing in the past is merging block data and metadata files.) This makes it possible to have a cluster where you have some block files in one format and some in another-- a necessity for doing a real-world transition. The clients are passed the version number, so they can act intelligently-- or simply refuse to read the newer formats if they don't know how.

          Because this patch depends on the DomainSocket code, it currently incorporates that code. HADOOP-6311 is the best place to comment about DomainSocket, since that is what that JIRA is about.

          Show
          Colin Patrick McCabe added a comment - This is an updated patch based on a discussion we had in HADOOP-6311 . Basically, the current design is to pass file descriptors over a new class named DomainSocket , which represents UNIX domain sockets. This is accomplished by adding a new message to the DataTransferProtocol , RequestShortCircuitFd . The DataXceiverServer can manage these UNIX domain sockets just as easily as it manages existing the IPv4 sockets, because they implement the same interfaces. One thing I refactored in this patch is BlockReaderFactory . It formerly contained only static methods; this patch changes it to be a "real" class with instance methods and instance data. I felt that the BlockReaderFactory methods were getting too unwieldy because we were passing a tremendous amount of parameters, many of which could be considered properties of the factory in a sense. Using instance data also allows the factory to keep a blacklist of which DataNodes do not support file descriptor passing. It uses this information to avoid making unnecesary requests. This patch also introduces the concept of a format version number for blocks. The idea here is that if we later change the block format on-disk, we want to be able to tell clients that they can't short-circuit access these blocks unless they can understand the corresponding version number. (One change we've talked a lot about doing in the past is merging block data and metadata files.) This makes it possible to have a cluster where you have some block files in one format and some in another-- a necessity for doing a real-world transition. The clients are passed the version number, so they can act intelligently-- or simply refuse to read the newer formats if they don't know how. Because this patch depends on the DomainSocket code, it currently incorporates that code. HADOOP-6311 is the best place to comment about DomainSocket , since that is what that JIRA is about.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552211/HDFS-347.020.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 6 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.net.unix.TestDomainSocket

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552211/HDFS-347.020.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 8 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 6 warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.net.unix.TestDomainSocket +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3447//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • nicer logs when short-circuit can't be enabled
          • use TemporarySocketDirectory in TestParallelLocalRead, TestShortCircuitLocalRead
          Show
          Colin Patrick McCabe added a comment - rebase on HADOOP-6311 fixes nicer logs when short-circuit can't be enabled use TemporarySocketDirectory in TestParallelLocalRead , TestShortCircuitLocalRead
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552331/HDFS-347.021.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552331/HDFS-347.021.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3456//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          fix warnings

          Show
          Colin Patrick McCabe added a comment - fix warnings
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552901/HDFS-347.022.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3476//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3476//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552901/HDFS-347.022.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3476//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3476//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The current implementation puts all the communication with the DN over the unix socket. I think it would be worth having three modes for this configuration:

          1) Disabled – code paths identical to today
          2) Enabled for FD passing only – only connects via unix socket if it's about to try to do fd-passing. Otherwise, it uses loopback TCP
          3) Enabled for FD passing and all communication

          The reason for option 1 is obvious: it's a lot of new code and we'd be wise to introduce it as "experimental" initially.
          The reason for option 2 is that, if we only use it for fd passing, we don't need to care about performance or subtle bugs in the data transfer path. The FD transfer has the nice property that it either works or doesn't work - it's much less likely that it would pass a 'corrupt' FD. Also, the unix socket path seems to be much slower than TCP in the current implementation (see more below)
          The reason for option 3 is that, according to benchmarks seen elsewhere (and 'netperf'), the unix sockets should be able to go 2x the speed of TCP loopback once we spend some time optimizing them. This would have some benefits:

          • faster performance, with no semantic difference (eg metrics and architectural layering maintained)
          • improvements on the write path as well as the read path

          If the data-over-unix-sockets path is significantly faster than the existing TCP path (I think it should be possible to get 2x), then that seems like the kind of thing we'd want on by default for every MR task, etc, since we'd get the speedup without any cost of lost metrics or QoS opportunities in the DN. I can see still wanting fd passing for applications like HBase that are heavily random-access oriented, but for streaming, I think if we can get 'close' to the optimal, the metrics are worth more than the last little bit of juice.

          I spent some time looking at the performance of unix sockets (data path over unix, not fd passing) in your current patch, and found that the data path is at least 2x slower in my benchmark, and uses 3x as much CPU. This seems to be due to a number of things:

          • The domain socket doesn't implement transferTo (aka sendfile). So, we end up doing a lot more copies on the sending side to go in and out of kernel space
          • "CallIntMethod" is showing up a lot in my 'perf top' output. This seems to be from within the readByteBuffer0 call. I think we can optimize this significantly as follows:
            • Assume that the Java code always passes a direct buffer into the native code. If the user supplies a non direct buffer, use a cached 32KB (or so) direct buffer inside the InputStream to read into and then copy into the user-supplied array-based buffer. Given that our read path always uses direct buffers, this should be an easy simplification.
            • Pass the buffer's offset and remaining length in via parameters to the JNI function, rather than calling "back" into Java with CallIntMethod. This should have significantly better performance, since the JIT will take care of inlining and lock elision on the Java side.
          • In the read() call, you're currently calling fdRef() and fdUnref() every time. Looking at the implementation of the similar pieces of the JDK, they get around this kind of overhead. It would be interesting to try 'breaking' the code to not do the ref counting on read, to see if it's a bottleneck. My guess is that it might be, since the atomic operations end up issuing a reasonably costly memory barrier, somewhat needlessly.

          Overall, I'd try to model the code a little closer to the built-in JDK implementations of SocketChannel, etc.

          All of the above only matters if the data path is going over the unix sockets (option 3 above). Hence the suggestion that we could do a more minimal initial implementation without offering option 3, or at least not recommending option 3, and then work to do the optimization for the data path separately.

          Regarding test plan, have you thought about how we can verify this? It's a lot of new code if we assume that the data path may run over it. I'm particularly concerned about things like timeout handling or races on socket close which could lock up a datanode or cause an FD leak. Explaining a test plan that covers things like this would be helpful. (One of the original reasons that I liked importing the Android code was that it's likely to have been well tested, whereas this patch has nearly the same amount of new code, except that it hasn't been baked anywhere yet).

          I have some comments on the code itself, but I want to take a few more passes through it to understand it all better before I post - no sense nit picking small things when there are bigger questions per above.

          Show
          Todd Lipcon added a comment - The current implementation puts all the communication with the DN over the unix socket. I think it would be worth having three modes for this configuration: 1) Disabled – code paths identical to today 2) Enabled for FD passing only – only connects via unix socket if it's about to try to do fd-passing. Otherwise, it uses loopback TCP 3) Enabled for FD passing and all communication The reason for option 1 is obvious: it's a lot of new code and we'd be wise to introduce it as "experimental" initially. The reason for option 2 is that, if we only use it for fd passing, we don't need to care about performance or subtle bugs in the data transfer path. The FD transfer has the nice property that it either works or doesn't work - it's much less likely that it would pass a 'corrupt' FD. Also, the unix socket path seems to be much slower than TCP in the current implementation (see more below) The reason for option 3 is that, according to benchmarks seen elsewhere (and 'netperf'), the unix sockets should be able to go 2x the speed of TCP loopback once we spend some time optimizing them. This would have some benefits: faster performance, with no semantic difference (eg metrics and architectural layering maintained) improvements on the write path as well as the read path If the data-over-unix-sockets path is significantly faster than the existing TCP path (I think it should be possible to get 2x), then that seems like the kind of thing we'd want on by default for every MR task, etc, since we'd get the speedup without any cost of lost metrics or QoS opportunities in the DN. I can see still wanting fd passing for applications like HBase that are heavily random-access oriented, but for streaming, I think if we can get 'close' to the optimal, the metrics are worth more than the last little bit of juice. I spent some time looking at the performance of unix sockets (data path over unix, not fd passing) in your current patch, and found that the data path is at least 2x slower in my benchmark, and uses 3x as much CPU. This seems to be due to a number of things: The domain socket doesn't implement transferTo (aka sendfile). So, we end up doing a lot more copies on the sending side to go in and out of kernel space "CallIntMethod" is showing up a lot in my 'perf top' output. This seems to be from within the readByteBuffer0 call. I think we can optimize this significantly as follows: Assume that the Java code always passes a direct buffer into the native code. If the user supplies a non direct buffer, use a cached 32KB (or so) direct buffer inside the InputStream to read into and then copy into the user-supplied array-based buffer. Given that our read path always uses direct buffers, this should be an easy simplification. Pass the buffer's offset and remaining length in via parameters to the JNI function, rather than calling "back" into Java with CallIntMethod. This should have significantly better performance, since the JIT will take care of inlining and lock elision on the Java side. In the read() call, you're currently calling fdRef() and fdUnref() every time. Looking at the implementation of the similar pieces of the JDK, they get around this kind of overhead. It would be interesting to try 'breaking' the code to not do the ref counting on read, to see if it's a bottleneck. My guess is that it might be, since the atomic operations end up issuing a reasonably costly memory barrier, somewhat needlessly. Overall, I'd try to model the code a little closer to the built-in JDK implementations of SocketChannel, etc. All of the above only matters if the data path is going over the unix sockets (option 3 above). Hence the suggestion that we could do a more minimal initial implementation without offering option 3, or at least not recommending option 3, and then work to do the optimization for the data path separately. Regarding test plan, have you thought about how we can verify this? It's a lot of new code if we assume that the data path may run over it. I'm particularly concerned about things like timeout handling or races on socket close which could lock up a datanode or cause an FD leak. Explaining a test plan that covers things like this would be helpful. (One of the original reasons that I liked importing the Android code was that it's likely to have been well tested, whereas this patch has nearly the same amount of new code, except that it hasn't been baked anywhere yet). I have some comments on the code itself, but I want to take a few more passes through it to understand it all better before I post - no sense nit picking small things when there are bigger questions per above.
          Hide
          Allen Wittenauer added a comment -

          BTW, is this still insecure or has that been fixed?

          Show
          Allen Wittenauer added a comment - BTW, is this still insecure or has that been fixed?
          Hide
          Todd Lipcon added a comment -

          This is no longer insecure - it uses file descriptor passing over a unix socket so that the DN is the one arbitrating all access.

          I implemented a couple of the optimizations mentioned above (avoiding CallIntMethod and adding sendfile() support) and now the unix data path is a little bit faster than the TCP path:

          over unix sockets:
          todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ time ./bin/hadoop fs -Ddfs.datanode.domain.socket.path=/tmp/dn-sock  -cat $(for x in $(seq 1 20) ; do echo /user/todd/1GB ; done) | wc -c
          
          datanode utime: 2.02
          datanode stime: 11.22
          real    0m24.137s
          user    0m12.530s
          sys     0m16.270s
          
          over TCP:
          todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ time ./bin/hadoop fs   -cat $(for x in $(seq 1 20) ; do echo /user/todd/1GB ; done) | wc -c
          20971520000
          datanode utime: 5.47
          datanode stime: 6.52
          real    0m26.473s
          user    0m12.750s
          sys     0m21.010s
          

          The results above are a bit strange that the system time is better on the DN for TCP vs local sockets. I'm guessing a little investigation there will make it a bit more clear - perhaps a similar improvement to the writeBuffer code would yield a speedup.

          Something seems to be wrong with the fd-passing (short-circuit) path in this patch, though. When I enabled it, I could tell from jstacks that it was "working" but I got really slow performance:

          real    1m5.366s
          user    0m35.710s
          sys     0m37.700s
          

          I couldn't understand from the code why BlockReaderLocal is substantially rewritten. I'd think it would be pretty much identical after the point where you get the files open. I'm guessing the rewrite is what killed performance here.

          Show
          Todd Lipcon added a comment - This is no longer insecure - it uses file descriptor passing over a unix socket so that the DN is the one arbitrating all access. I implemented a couple of the optimizations mentioned above (avoiding CallIntMethod and adding sendfile() support) and now the unix data path is a little bit faster than the TCP path: over unix sockets: todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ time ./bin/hadoop fs -Ddfs.datanode.domain.socket.path=/tmp/dn-sock -cat $( for x in $(seq 1 20) ; do echo /user/todd/1GB ; done) | wc -c datanode utime: 2.02 datanode stime: 11.22 real 0m24.137s user 0m12.530s sys 0m16.270s over TCP: todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ time ./bin/hadoop fs -cat $( for x in $(seq 1 20) ; do echo /user/todd/1GB ; done) | wc -c 20971520000 datanode utime: 5.47 datanode stime: 6.52 real 0m26.473s user 0m12.750s sys 0m21.010s The results above are a bit strange that the system time is better on the DN for TCP vs local sockets. I'm guessing a little investigation there will make it a bit more clear - perhaps a similar improvement to the writeBuffer code would yield a speedup. Something seems to be wrong with the fd-passing (short-circuit) path in this patch, though. When I enabled it, I could tell from jstacks that it was "working" but I got really slow performance: real 1m5.366s user 0m35.710s sys 0m37.700s I couldn't understand from the code why BlockReaderLocal is substantially rewritten. I'd think it would be pretty much identical after the point where you get the files open. I'm guessing the rewrite is what killed performance here.
          Hide
          Todd Lipcon added a comment -

          Oops, forgot the link to my improved code:
          https://github.com/toddlipcon/hadoop-common/tree/hdfs-347-colin

          (has some hackiness in places, was just doing it to test out the performance of these fixes)

          Show
          Todd Lipcon added a comment - Oops, forgot the link to my improved code: https://github.com/toddlipcon/hadoop-common/tree/hdfs-347-colin (has some hackiness in places, was just doing it to test out the performance of these fixes)
          Hide
          Colin Patrick McCabe added a comment -

          I will see if I can reuse the old BlockReaderLocal code. There was a bunch of stuff in it that was no longer relevant, but most of it can probably be reused.

          I'll add a way to enable fd passing, but not data-over-unix-sockets.

          Hence the suggestion that we could do a more minimal initial implementation without offering [data-over-unix-sockets], or at least not recommending [it], and then work to do the optimization for the data path separately.

          I agree. That kind of work would best be done in follow-up JIRAs.

          Show
          Colin Patrick McCabe added a comment - I will see if I can reuse the old BlockReaderLocal code. There was a bunch of stuff in it that was no longer relevant, but most of it can probably be reused. I'll add a way to enable fd passing, but not data-over-unix-sockets. Hence the suggestion that we could do a more minimal initial implementation without offering [data-over-unix-sockets] , or at least not recommending [it] , and then work to do the optimization for the data path separately. I agree. That kind of work would best be done in follow-up JIRAs.
          Hide
          Colin Patrick McCabe added a comment -
          • Use the previous BlockReaderLocal code, with a few updates.
          • test some different combinations:
            UNIX domain sockets
            UNIX domain sockets + short circuit
            UNIX domain sockets + short circuit + skip checksum
          Show
          Colin Patrick McCabe added a comment - Use the previous BlockReaderLocal code, with a few updates. test some different combinations: UNIX domain sockets UNIX domain sockets + short circuit UNIX domain sockets + short circuit + skip checksum
          Hide
          Colin Patrick McCabe added a comment -

          The speed issue with short-circuit reads seems to be resolved in this latest patch. I did not start optimizing domain socket throughput just yet. I feel that could take a while and is best done in another JIRA.

          With regard to the atomic operations comment, one thing that we could do is increment the refcount when creating the inputstream / outputstream and decrement it when the streams were closed. That would allow read and write from the streams without reference counting. But again, I'd rather do that in another JIRA.

          Show
          Colin Patrick McCabe added a comment - The speed issue with short-circuit reads seems to be resolved in this latest patch. I did not start optimizing domain socket throughput just yet. I feel that could take a while and is best done in another JIRA. With regard to the atomic operations comment, one thing that we could do is increment the refcount when creating the inputstream / outputstream and decrement it when the streams were closed. That would allow read and write from the streams without reference counting. But again, I'd rather do that in another JIRA.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553416/HDFS-347.024.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestParallelReadUtil

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3498//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3498//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553416/HDFS-347.024.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestParallelReadUtil +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3498//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3498//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          It looks like junit needs some prodding not to try to directly execute the methods in TestParallelReadUtil

          Show
          Colin Patrick McCabe added a comment - It looks like junit needs some prodding not to try to directly execute the methods in TestParallelReadUtil
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553528/HDFS-347.025.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancer

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3503//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3503//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553528/HDFS-347.025.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancer +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3503//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3503//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          test failure is unrelated (this change doesn't affect the balancer)

          Show
          Colin Patrick McCabe added a comment - test failure is unrelated (this change doesn't affect the balancer)
          Hide
          Colin Patrick McCabe added a comment -

          Here are some benchmarks I did locally on a one-node cluster. I did these to confirm that there are no performance regressions with the new implementation.

          With HDFS-347 and dfs.client.read.shortcircuit = true and dfs.client.read.shortcircuit.skip.checksum = false:

          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.46user 3.38system 0:09.50elapsed 114%CPU (0avgtext+0avgdata 423200maxresident)k
          0inputs+104outputs (0major+25697minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.39user 3.37system 0:09.43elapsed 114%CPU (0avgtext+0avgdata 430352maxresident)k
          0inputs+144outputs (0major+24399minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.41user 3.39system 0:09.51elapsed 113%CPU (0avgtext+0avgdata 439536maxresident)k
          0inputs+144outputs (0major+25609minor)pagefaults 0swaps
          =========================================
          With unmodified trunk and dfs.client.read.shortcircuit = true and dfs.client.read.shortcircuit.skip.checksum = false, and dfs.block.local-path-access.user = cmccabe:

          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.60user 3.58system 0:09.89elapsed 113%CPU (0avgtext+0avgdata 444848maxresident)k
          0inputs+64outputs (0major+25903minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.65user 3.44system 0:09.57elapsed 115%CPU (0avgtext+0avgdata 443824maxresident)k
          0inputs+64outputs (0major+24054minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          7.50user 3.43system 0:09.42elapsed 116%CPU (0avgtext+0avgdata 422624maxresident)k
          0inputs+64outputs (0major+25918minor)pagefaults 0swaps
          =========================================
          with HDFS-347 and dfs.client.read.shortcircuit = false
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          10.15user 8.83system 0:17.88elapsed 106%CPU (0avgtext+0avgdata 412512maxresident)k
          0inputs+224outputs (0major+24449minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          10.19user 8.55system 0:17.23elapsed 108%CPU (0avgtext+0avgdata 449248maxresident)k
          0inputs+184outputs (0major+24109minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          10.24user 8.38system 0:17.16elapsed 108%CPU (0avgtext+0avgdata 439568maxresident)k
          0inputs+144outputs (0major+23957minor)pagefaults 0swaps
          =========================================
          with unmodified trunk and dfs.client.read.shortcircuit = false

          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /
          1g /1g /1g /1g /1g /1g /1g >/dev/null
          10.76user 8.64system 0:18.18elapsed 106%CPU (0avgtext+0avgdata 483872maxresident)k
          0inputs+64outputs (0major+28735minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          10.59user 8.54system 0:17.46elapsed 109%CPU (0avgtext+0avgdata 491216maxresident)k
          0inputs+64outputs (0major+27868minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null
          9.81user 8.95system 0:17.24elapsed 108%CPU (0avgtext+0avgdata 422144maxresident)k
          0inputs+64outputs (0major+25726minor)pagefaults 0swaps

          Show
          Colin Patrick McCabe added a comment - Here are some benchmarks I did locally on a one-node cluster. I did these to confirm that there are no performance regressions with the new implementation. With HDFS-347 and dfs.client.read.shortcircuit = true and dfs.client.read.shortcircuit.skip.checksum = false: cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.46user 3.38system 0:09.50elapsed 114%CPU (0avgtext+0avgdata 423200maxresident)k 0inputs+104outputs (0major+25697minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.39user 3.37system 0:09.43elapsed 114%CPU (0avgtext+0avgdata 430352maxresident)k 0inputs+144outputs (0major+24399minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.41user 3.39system 0:09.51elapsed 113%CPU (0avgtext+0avgdata 439536maxresident)k 0inputs+144outputs (0major+25609minor)pagefaults 0swaps ========================================= With unmodified trunk and dfs.client.read.shortcircuit = true and dfs.client.read.shortcircuit.skip.checksum = false, and dfs.block.local-path-access.user = cmccabe: cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.60user 3.58system 0:09.89elapsed 113%CPU (0avgtext+0avgdata 444848maxresident)k 0inputs+64outputs (0major+25903minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.65user 3.44system 0:09.57elapsed 115%CPU (0avgtext+0avgdata 443824maxresident)k 0inputs+64outputs (0major+24054minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.50user 3.43system 0:09.42elapsed 116%CPU (0avgtext+0avgdata 422624maxresident)k 0inputs+64outputs (0major+25918minor)pagefaults 0swaps ========================================= with HDFS-347 and dfs.client.read.shortcircuit = false cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.15user 8.83system 0:17.88elapsed 106%CPU (0avgtext+0avgdata 412512maxresident)k 0inputs+224outputs (0major+24449minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.19user 8.55system 0:17.23elapsed 108%CPU (0avgtext+0avgdata 449248maxresident)k 0inputs+184outputs (0major+24109minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.24user 8.38system 0:17.16elapsed 108%CPU (0avgtext+0avgdata 439568maxresident)k 0inputs+144outputs (0major+23957minor)pagefaults 0swaps ========================================= with unmodified trunk and dfs.client.read.shortcircuit = false cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g / 1g /1g /1g /1g /1g /1g /1g >/dev/null 10.76user 8.64system 0:18.18elapsed 106%CPU (0avgtext+0avgdata 483872maxresident)k 0inputs+64outputs (0major+28735minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.59user 8.54system 0:17.46elapsed 109%CPU (0avgtext+0avgdata 491216maxresident)k 0inputs+64outputs (0major+27868minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 9.81user 8.95system 0:17.24elapsed 108%CPU (0avgtext+0avgdata 422144maxresident)k 0inputs+64outputs (0major+25726minor)pagefaults 0swaps
          Hide
          Todd Lipcon added a comment -

          I believe the reviewboard-to-JIRA gateway is broken, so for those who would like to follow along, there is a review-board post of Colin's latest patch that I posted here: https://reviews.apache.org/r/8554/. It should also be CCing all traffic to the hdfs-dev mailing list, I believe.

          Show
          Todd Lipcon added a comment - I believe the reviewboard-to-JIRA gateway is broken, so for those who would like to follow along, there is a review-board post of Colin's latest patch that I posted here: https://reviews.apache.org/r/8554/ . It should also be CCing all traffic to the hdfs-dev mailing list, I believe.
          Hide
          Colin Patrick McCabe added a comment -

          address todd's comments (see reviewboard)

          Show
          Colin Patrick McCabe added a comment - address todd's comments (see reviewboard)
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12560908/HDFS-347.026.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3662//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3662//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560908/HDFS-347.026.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3662//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3662//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This doesn't address all the points in the reviewboard (still working on another rev which does.) However it does have the path security validation, the addition of dfs.client.domain.socket.data.traffic, some refactoring of BlockReaderFactory and the addition of DomainSocketFactory, and renaming of getBindPath to getBoundPath.

          Show
          Colin Patrick McCabe added a comment - This doesn't address all the points in the reviewboard (still working on another rev which does.) However it does have the path security validation, the addition of dfs.client.domain.socket.data.traffic , some refactoring of BlockReaderFactory and the addition of DomainSocketFactory, and renaming of getBindPath to getBoundPath .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561092/HDFS-347.027.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          -1 javac. The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561092/HDFS-347.027.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. -1 javac . The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3669//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          fixes:

          • put DomainSockets back in the cache right after constructing a BlockReaderLocal, rather than holding them for the lifetime of the reader.
          • disable short-circuit local reads for blocks under construction
          • add back support for old path-based RPC
          Show
          Colin Patrick McCabe added a comment - fixes: put DomainSockets back in the cache right after constructing a BlockReaderLocal , rather than holding them for the lifetime of the reader. disable short-circuit local reads for blocks under construction add back support for old path-based RPC
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561160/HDFS-347.029.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          -1 javac. The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561160/HDFS-347.029.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. -1 javac . The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3671//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          I did some manual testing and found that it worked with Kerberos enabled.

          I also found that it is competitive with the old local block reader implementation on my test.
          My test was catting a 1G file 7 times from FsShell.

          Numbers for the old local reads implementation:

          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          4.34user 1.30system 0:04.27elapsed 132%CPU (0avgtext+0avgdata 418592maxresident)k
          0inputs+88outputs (0major+25448minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          4.34user 1.27system 0:04.28elapsed 131%CPU (0avgtext+0avgdata 419456maxresident)k
          0inputs+72outputs (0major+24315minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          4.51user 1.29system 0:04.31elapsed 134%CPU (0avgtext+0avgdata 450320maxresident)k
          0inputs+72outputs (0major+25563minor)pagefaults 0swaps
          

          New implementation:

          cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'                     
          4.35user 1.28system 0:04.43elapsed 127%CPU (0avgtext+0avgdata 421520maxresident)k
          0inputs+72outputs (0major+25717minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          4.28user 1.24system 0:04.41elapsed 125%CPU (0avgtext+0avgdata 424480maxresident)k
          0inputs+72outputs (0major+24634minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          4.36user 1.30system 0:04.51elapsed 125%CPU (0avgtext+0avgdata 453280maxresident)k
          0inputs+80outputs (0major+25360minor)pagefaults 0swaps
          

          No local reads:

          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          6.01user 3.15system 0:08.06elapsed 113%CPU (0avgtext+0avgdata 434000maxresident)k
          0inputs+64outputs (0major+25949minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          5.24user 3.15system 0:07.09elapsed 118%CPU (0avgtext+0avgdata 443088maxresident)k
          0inputs+64outputs (0major+24773minor)pagefaults 0swaps
          cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/null'
          5.32user 3.13system 0:07.16elapsed 118%CPU (0avgtext+0avgdata 445472maxresident)k
          0inputs+64outputs (0major+24819minor)pagefaults 0swaps
          
          Show
          Colin Patrick McCabe added a comment - I did some manual testing and found that it worked with Kerberos enabled. I also found that it is competitive with the old local block reader implementation on my test. My test was catting a 1G file 7 times from FsShell. Numbers for the old local reads implementation: cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.34user 1.30system 0:04.27elapsed 132%CPU (0avgtext+0avgdata 418592maxresident)k 0inputs+88outputs (0major+25448minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.34user 1.27system 0:04.28elapsed 131%CPU (0avgtext+0avgdata 419456maxresident)k 0inputs+72outputs (0major+24315minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.51user 1.29system 0:04.31elapsed 134%CPU (0avgtext+0avgdata 450320maxresident)k 0inputs+72outputs (0major+25563minor)pagefaults 0swaps New implementation: cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.35user 1.28system 0:04.43elapsed 127%CPU (0avgtext+0avgdata 421520maxresident)k 0inputs+72outputs (0major+25717minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.28user 1.24system 0:04.41elapsed 125%CPU (0avgtext+0avgdata 424480maxresident)k 0inputs+72outputs (0major+24634minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 4.36user 1.30system 0:04.51elapsed 125%CPU (0avgtext+0avgdata 453280maxresident)k 0inputs+80outputs (0major+25360minor)pagefaults 0swaps No local reads: cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 6.01user 3.15system 0:08.06elapsed 113%CPU (0avgtext+0avgdata 434000maxresident)k 0inputs+64outputs (0major+25949minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 5.24user 3.15system 0:07.09elapsed 118%CPU (0avgtext+0avgdata 443088maxresident)k 0inputs+64outputs (0major+24773minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time bash -c './bin/hadoop fs -cat /zero /zero /zero /zero /zero /zero /zero &> /dev/ null ' 5.32user 3.13system 0:07.16elapsed 118%CPU (0avgtext+0avgdata 445472maxresident)k 0inputs+64outputs (0major+24819minor)pagefaults 0swaps
          Hide
          Colin Patrick McCabe added a comment -

          this revision fixes an issue with the socket path permission checking

          Show
          Colin Patrick McCabe added a comment - this revision fixes an issue with the socket path permission checking
          Hide
          Allen Wittenauer added a comment -

          While I appreciate that you did a check to make sure it 'works' with Kerberos, it'd be good to verify that we can't read data blocks where we lack the privileges to do so. (Yes, I'm paranoid, but I'm concerned about the case where we start blindly asking the datanodes for blocks where we bypass the NN completely.)

          Show
          Allen Wittenauer added a comment - While I appreciate that you did a check to make sure it 'works' with Kerberos, it'd be good to verify that we can't read data blocks where we lack the privileges to do so. (Yes, I'm paranoid, but I'm concerned about the case where we start blindly asking the datanodes for blocks where we bypass the NN completely.)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561359/HDFS-347.030.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 14 new or modified test files.

          -1 javac. The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.net.unix.TestDomainSocket
          org.apache.hadoop.ha.TestZKFailoverController
          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561359/HDFS-347.030.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 14 new or modified test files. -1 javac . The applied patch generated 2013 javac compiler warnings (more than the trunk's current 2012 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.net.unix.TestDomainSocket org.apache.hadoop.ha.TestZKFailoverController org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3674//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This is an updated patch which gives you some idea of how we can eliminate the Socket dependency as we talked about earlier. The JNI part still needs some work, mostly moving a few things around.

          Show
          Colin Patrick McCabe added a comment - This is an updated patch which gives you some idea of how we can eliminate the Socket dependency as we talked about earlier. The JNI part still needs some work, mostly moving a few things around.
          Hide
          Colin Patrick McCabe added a comment -

          Allen said:

          it'd be good to verify that we can't read data blocks where we lack the privileges to do so.

          Sure. We'll test this case too.

          Show
          Colin Patrick McCabe added a comment - Allen said: it'd be good to verify that we can't read data blocks where we lack the privileges to do so. Sure. We'll test this case too.
          Hide
          Liang Xie added a comment -

          I'm a little comfused, could anybody explain what's benefit of this issue againt HDFS-2246 ? IMHO, MR tasks with security enabled should be in. Thanks

          Show
          Liang Xie added a comment - I'm a little comfused, could anybody explain what's benefit of this issue againt HDFS-2246 ? IMHO, MR tasks with security enabled should be in. Thanks
          Hide
          Colin Patrick McCabe added a comment -

          I'm a little comfused, could anybody explain what's benefit of this issue againt HDFS-2246 ? IMHO, MR tasks with security enabled should be in. Thanks

          https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13169578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13169578

          Show
          Colin Patrick McCabe added a comment - I'm a little comfused, could anybody explain what's benefit of this issue againt HDFS-2246 ? IMHO, MR tasks with security enabled should be in. Thanks https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13169578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13169578
          Hide
          Colin Patrick McCabe added a comment -

          bugfixes.

          Show
          Colin Patrick McCabe added a comment - bugfixes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12562620/HDFS-347.035.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 19 new or modified test files.

          -1 javac. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings).

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestFsck
          org.apache.hadoop.hdfs.server.datanode.TestDatanodeJsp

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562620/HDFS-347.035.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 19 new or modified test files. -1 javac . The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). -1 javadoc . The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestFsck org.apache.hadoop.hdfs.server.datanode.TestDatanodeJsp +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3706//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          For those following along, my review comments on patch 035 are at: https://reviews.apache.org/r/8554/

          In addition to the failed tests, Colin, can you please take a look at the javac/javadoc warnings above?

          Show
          Todd Lipcon added a comment - For those following along, my review comments on patch 035 are at: https://reviews.apache.org/r/8554/ In addition to the failed tests, Colin, can you please take a look at the javac/javadoc warnings above?
          Hide
          Colin Patrick McCabe added a comment -

          Thanks for the reviews on ReviewBoard, Todd. I have split this JIRA into four subtasks. If anyone is interested in hearing more about this issue, please watch HDFS-4352, HDFS-4353, HDFS-4354, and HDFS-4356.

          Show
          Colin Patrick McCabe added a comment - Thanks for the reviews on ReviewBoard, Todd. I have split this JIRA into four subtasks. If anyone is interested in hearing more about this issue, please watch HDFS-4352 , HDFS-4353 , HDFS-4354 , and HDFS-4356 .
          Hide
          Suresh Srinivas added a comment -

          I posted this comment in a subtask. Can this jira work be done in a branch, instead of trunk?

          Show
          Suresh Srinivas added a comment - I posted this comment in a subtask. Can this jira work be done in a branch, instead of trunk?
          Hide
          Todd Lipcon added a comment -

          Sure. Created a branch. I anticipate having all the work committed in the next day or two and will call a merge immediately. Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches.

          Show
          Todd Lipcon added a comment - Sure. Created a branch. I anticipate having all the work committed in the next day or two and will call a merge immediately. Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches.
          Hide
          Suresh Srinivas added a comment -

          Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches.

          Just because there are 100+ watchers, I do not see either design or review comments from more than handful of people. I plan to review this in a timely manner. But if it requires time, I expect that such time should be given, instead of hurrying the reviewer.

          Show
          Suresh Srinivas added a comment - Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches. Just because there are 100+ watchers, I do not see either design or review comments from more than handful of people. I plan to review this in a timely manner. But if it requires time, I expect that such time should be given, instead of hurrying the reviewer.
          Hide
          Todd Lipcon added a comment -

          I believe all of the component pieces have now been committed to the HDFS-347 branch. I ran a number of benchmarks yesterday on the branch in progress, and just re-confirmed the results from the code committed in SVN. Here's a report of the benchmarks and results:

          Benchmarks

          To validate the branch, I ran a series of before/after benchmarks, specifically focused on random-read. In particular, I ran benchmarks based on TestParallelRead, which has different variants which run the same workload through the different read paths.

          On the trunk ("before") branch, I ran TestParallelRead (normal read path) and TestParallelLocalRead (read path based on HDFS-2246). On the HDFS-347 branch, I ran TestParallelRead (normal read path) and TestParallelShortCircuitRead (new short-circuit path).

          I made the following modifications to the test cases to act as a better benchmark:

          1) Modified to 0% PROPORTION_NON_READ:

          Without this modification, I found that both the 'before' and 'after' tests became lock-bound, since the 'seek-and-read' workload holds a lock on the DFSInputStream. So, this obscured the actual performance differences between the data paths.

          2) Modified to 30,000 iterations

          Simply jacked up the number of iterations to get more reproducible results and ensure that the JIT had plenty of time to kick in (the benchmarks ran for ~50seconds each with this change instead of only ~5sec)

          3) Added a variation which has two target blocks

          I had a thought that there could potentially be a regression for workloads which frequently switch back and forth between two different blocks of the same file. This variation is the same test, but with the DFS Block Size set to 128KB, so that the 256KB test file is split into two equal sized blocks. This causes a good percentage of the random reads to span block boundaries, and ensures that the various caches in the code work OK even when moving between different blocks.

          Comparing non-local read

          When the new code path is disabled, or when the DN is not local, we continue to use the existing code path. We expect that this code path's performance should be unaffected.

          Results:

          Test #Threads #Files Trunk MB/sec HDFS-347 MB/sec
          TestParallelRead 4 1 428.4 423.0
          TestParallelRead 16 1 669.5 651.1
          TestParallelRead 8 2 603.4 582.7
          TestParallelRead 2-blocks 4 1 354.0 345.9
          TestParallelRead 2-blocks 16 1 534.9 520.0
          TestParallelRead 2-blocks 8 2 483.1 460.8

          The above numbers seem to show a 2-4% regression, but I think it's within the noise on my machine (other software was running, etc). Colin also has one or two ideas for micro-optimizations which might win back a couple percent here and there, if it's not just noise.

          To put this in perspective, here are results for the same test against branch-1:

          Test #Threads #Files Branch-1
          TestParallelRead 4 1 229.7
          TestParallelRead 16 1 264.4
          TestParallelRead 8 2 260.1

          (so trunk is 2-3x as fast as branch-1)

          Comparing local read

          Here we expect the performance to be as good or better than the old (HDFS-2246) implementation. Results:

          Test #Threads #Files Trunk MB/sec HDFS-347 MB/sec
          TestParallelLocalRead 4 1 901.4 1033.6
          TestParallelLocalRead 16 1 1079.8 1203.9
          TestParallelLocalRead 8 2 1087.4 1274.0
          TestParallelLocalRead 2-blocks 4 1 856.6 919.2
          TestParallelLocalRead 2-blocks 16 1 1045.8 1137.0
          TestParallelLocalRead 2-blocks 8 2 966.7 1392.9

          The result shows that the new implementation is indeed between 10% and 44% faster than the HDFS-2246 implementation. We're theorizing that the reason is because the old implementation would cache block paths, but not open file descriptors. So, because every positional read creates a new BlockReader, it would have to issue new open() syscalls, even if the location was cached.

          Comparing sequential read

          I used the BenchmarkThroughput tool, configured to write a 1GB file, and then read it back 100 times. This ensures that it's in buffer cache, so that we're benchmarking CPU overhead (since the actual disk access didn't change in the patch, and we're looking for a potential regression in CPU resource usage). I recorded the MB/sec rate for the short-circuit before and short-circuit after, and then loaded the data into R and ran a T-test:

          > d.before <- read.table(file="/tmp/before-patch.txt")
          > d.after <- read.table(file="/tmp/after-patch.txt")
          > t.test(d.before, d.after)
          > d.before <- read.table(file="/tmp/before-patch.txt")
          > d.after <- read.table(file="/tmp/after-patch.txt")
          > t.test(d.before, d.after)
          
                  Welch Two Sample t-test
          
          data:  d.before and d.after 
          t = 0.5936, df = 199.777, p-value = 0.5535
          alternative hypothesis: true difference in means is not equal to 0 
          95 percent confidence interval:
           -62.39975 116.14431 
          sample estimates:
          mean of x mean of y 
           2939.456  2912.584 
          

          The p-value 0.55 means that there's no statistically significant difference in the performance of the two data paths for sequential workloads.

          I did the same thing with short-circuit disabled and got the following t-test results for the RemoteBlockReader code path:

          > d.before <- read.table(file="/tmp/before-patch-rbr.txt")
          > d.after <- read.table(file="/tmp/after-patch-rbr.txt")
          > t.test(d.before, d.after)
          
                  Welch Two Sample t-test
          
          data:  d.before and d.after 
          t = 1.155, df = 199.89, p-value = 0.2495
          alternative hypothesis: true difference in means is not equal to 0 
          95 percent confidence interval:
           -18.69172  71.54320 
          sample estimates:
          mean of x mean of y 
           1454.653  1428.228 
          

          Again, the p-value 0.25 means there's no significant difference in performance.

          Summary

          The patch provides a good speedup (up to 40% in one case) for some random read workloads, and has no discernible negative impact on others.

          Show
          Todd Lipcon added a comment - I believe all of the component pieces have now been committed to the HDFS-347 branch. I ran a number of benchmarks yesterday on the branch in progress, and just re-confirmed the results from the code committed in SVN. Here's a report of the benchmarks and results: Benchmarks To validate the branch, I ran a series of before/after benchmarks, specifically focused on random-read. In particular, I ran benchmarks based on TestParallelRead, which has different variants which run the same workload through the different read paths. On the trunk ("before") branch, I ran TestParallelRead (normal read path) and TestParallelLocalRead (read path based on HDFS-2246 ). On the HDFS-347 branch, I ran TestParallelRead (normal read path) and TestParallelShortCircuitRead (new short-circuit path). I made the following modifications to the test cases to act as a better benchmark: 1) Modified to 0% PROPORTION_NON_READ: Without this modification, I found that both the 'before' and 'after' tests became lock-bound, since the 'seek-and-read' workload holds a lock on the DFSInputStream. So, this obscured the actual performance differences between the data paths. 2) Modified to 30,000 iterations Simply jacked up the number of iterations to get more reproducible results and ensure that the JIT had plenty of time to kick in (the benchmarks ran for ~50seconds each with this change instead of only ~5sec) 3) Added a variation which has two target blocks I had a thought that there could potentially be a regression for workloads which frequently switch back and forth between two different blocks of the same file. This variation is the same test, but with the DFS Block Size set to 128KB, so that the 256KB test file is split into two equal sized blocks. This causes a good percentage of the random reads to span block boundaries, and ensures that the various caches in the code work OK even when moving between different blocks. Comparing non-local read When the new code path is disabled, or when the DN is not local, we continue to use the existing code path. We expect that this code path's performance should be unaffected. Results: Test #Threads #Files Trunk MB/sec HDFS-347 MB/sec TestParallelRead 4 1 428.4 423.0 TestParallelRead 16 1 669.5 651.1 TestParallelRead 8 2 603.4 582.7 TestParallelRead 2-blocks 4 1 354.0 345.9 TestParallelRead 2-blocks 16 1 534.9 520.0 TestParallelRead 2-blocks 8 2 483.1 460.8 The above numbers seem to show a 2-4% regression, but I think it's within the noise on my machine (other software was running, etc). Colin also has one or two ideas for micro-optimizations which might win back a couple percent here and there, if it's not just noise. To put this in perspective, here are results for the same test against branch-1: Test #Threads #Files Branch-1 TestParallelRead 4 1 229.7 TestParallelRead 16 1 264.4 TestParallelRead 8 2 260.1 (so trunk is 2-3x as fast as branch-1) Comparing local read Here we expect the performance to be as good or better than the old ( HDFS-2246 ) implementation. Results: Test #Threads #Files Trunk MB/sec HDFS-347 MB/sec TestParallelLocalRead 4 1 901.4 1033.6 TestParallelLocalRead 16 1 1079.8 1203.9 TestParallelLocalRead 8 2 1087.4 1274.0 TestParallelLocalRead 2-blocks 4 1 856.6 919.2 TestParallelLocalRead 2-blocks 16 1 1045.8 1137.0 TestParallelLocalRead 2-blocks 8 2 966.7 1392.9 The result shows that the new implementation is indeed between 10% and 44% faster than the HDFS-2246 implementation. We're theorizing that the reason is because the old implementation would cache block paths, but not open file descriptors. So, because every positional read creates a new BlockReader, it would have to issue new open() syscalls, even if the location was cached. Comparing sequential read I used the BenchmarkThroughput tool, configured to write a 1GB file, and then read it back 100 times. This ensures that it's in buffer cache, so that we're benchmarking CPU overhead (since the actual disk access didn't change in the patch, and we're looking for a potential regression in CPU resource usage). I recorded the MB/sec rate for the short-circuit before and short-circuit after, and then loaded the data into R and ran a T-test: > d.before <- read.table(file= "/tmp/before-patch.txt" ) > d.after <- read.table(file= "/tmp/after-patch.txt" ) > t.test(d.before, d.after) > d.before <- read.table(file= "/tmp/before-patch.txt" ) > d.after <- read.table(file= "/tmp/after-patch.txt" ) > t.test(d.before, d.after) Welch Two Sample t-test data: d.before and d.after t = 0.5936, df = 199.777, p-value = 0.5535 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -62.39975 116.14431 sample estimates: mean of x mean of y 2939.456 2912.584 The p-value 0.55 means that there's no statistically significant difference in the performance of the two data paths for sequential workloads. I did the same thing with short-circuit disabled and got the following t-test results for the RemoteBlockReader code path: > d.before <- read.table(file= "/tmp/before-patch-rbr.txt" ) > d.after <- read.table(file= "/tmp/after-patch-rbr.txt" ) > t.test(d.before, d.after) Welch Two Sample t-test data: d.before and d.after t = 1.155, df = 199.89, p-value = 0.2495 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -18.69172 71.54320 sample estimates: mean of x mean of y 1454.653 1428.228 Again, the p-value 0.25 means there's no significant difference in performance. Summary The patch provides a good speedup (up to 40% in one case) for some random read workloads, and has no discernible negative impact on others.
          Hide
          Liang Xie added a comment -

          Impressive for the "10% and 44%" faster. I just have a trivial question : will it be merged to branch-2 as well?

          Show
          Liang Xie added a comment - Impressive for the "10% and 44%" faster. I just have a trivial question : will it be merged to branch-2 as well?
          Hide
          Todd Lipcon added a comment -

          Attached a consolidated patch from trunk to the branch (git diff from 8360a7a6a4497c47cf6a389a2663a4a2b4867a19..681737e78ba0ce574b92ff0ef3bd1794492af27e). (The actual merge will be done with an svn merge command, rather than applying this patch – this is just to get a full Jenkins run).

          Show
          Todd Lipcon added a comment - Attached a consolidated patch from trunk to the branch (git diff from 8360a7a6a4497c47cf6a389a2663a4a2b4867a19..681737e78ba0ce574b92ff0ef3bd1794492af27e). (The actual merge will be done with an svn merge command, rather than applying this patch – this is just to get a full Jenkins run).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12564780/hdfs-347-merge.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 3 release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564780/hdfs-347-merge.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 20 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 3 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3838//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          I just have a trivial question : will it be merged to branch-2 as well?

          The initial plan is to merge into trunk; branch-2 will come later.

          Show
          Colin Patrick McCabe added a comment - I just have a trivial question : will it be merged to branch-2 as well? The initial plan is to merge into trunk; branch-2 will come later.
          Hide
          Todd Lipcon added a comment -

          New merge patch for jenkins, which incorporates Colin's fix for the findbugs as well as a couple other miscellanea. I also fixed the RAT exclude list to ignore CHANGES.HDFS-347.txt.

          Show
          Todd Lipcon added a comment - New merge patch for jenkins, which incorporates Colin's fix for the findbugs as well as a couple other miscellanea. I also fixed the RAT exclude list to ignore CHANGES. HDFS-347 .txt.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12564825/hdfs-347-merge.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 2 release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564825/hdfs-347-merge.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 20 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 2 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3840//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The two release audit warnings are unrelated. See my comment on HADOOP-9097.

          Show
          Todd Lipcon added a comment - The two release audit warnings are unrelated. See my comment on HADOOP-9097 .
          Hide
          Brandon Li added a comment -

          @Todd, I tried to run the benchmark on my local machine. For TestParallelRead, I didn't see very noticeable regression between HDFS-347 and trunk, which is good.

          How did you run TestParallelLoalRead? I simply kept the original TestParallelLocalRead.java (it's deleted in the merge patch) but it doesn't seem to give bigger throughput than TestParallelRead with HDFS-347. I think I missed something here.

          Show
          Brandon Li added a comment - @Todd, I tried to run the benchmark on my local machine. For TestParallelRead, I didn't see very noticeable regression between HDFS-347 and trunk, which is good. How did you run TestParallelLoalRead? I simply kept the original TestParallelLocalRead.java (it's deleted in the merge patch) but it doesn't seem to give bigger throughput than TestParallelRead with HDFS-347 . I think I missed something here.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Brandon,

          TestParallelLocalRead was renamed to TestParallelShortCircuitRead. The original version is not going to work with this branch because it doesn't set the correct configuration keys. It will fall back on the standard read path.

          I think "test local read" was a very unclear name, because all of the TestParallel functions are testing local reads (we are on a MiniDFSCluster, after all.) It's the fact that we are testing short-circuit reads which is important.

          HDFS-347 also adds a few tests which have no equivalent in trunk, like TestParallelShortCircuitReadNoChecksum and TestParallelUnixDomainRead.

          Show
          Colin Patrick McCabe added a comment - Hi Brandon, TestParallelLocalRead was renamed to TestParallelShortCircuitRead . The original version is not going to work with this branch because it doesn't set the correct configuration keys. It will fall back on the standard read path. I think "test local read" was a very unclear name, because all of the TestParallel functions are testing local reads (we are on a MiniDFSCluster , after all.) It's the fact that we are testing short-circuit reads which is important. HDFS-347 also adds a few tests which have no equivalent in trunk, like TestParallelShortCircuitReadNoChecksum and TestParallelUnixDomainRead .
          Hide
          Todd Lipcon added a comment -

          Attaching another merge patch for Jenkins to run upstream, since both trunk and the branch have had a few changes since the last QA bot run.

          Show
          Todd Lipcon added a comment - Attaching another merge patch for Jenkins to run upstream, since both trunk and the branch have had a few changes since the last QA bot run.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566173/hdfs-347-merge.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestPeerCache

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3874//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3874//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566173/hdfs-347-merge.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 21 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestPeerCache +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3874//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3874//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure. Created a branch. I anticipate having all the work committed in the next day or two and will call a merge immediately. Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches.

          Todd, from your 9/Jan/2013 comment above, it seems that you have underestimated the work here. I think we should not underestimate again the review time.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure. Created a branch. I anticipate having all the work committed in the next day or two and will call a merge immediately. Keep in mind this work has been under review here for 2-3 months now, and there are 100+ watchers on this JIRA, so I don't anticipate needing a lengthy review period like we did for other branches. Todd, from your 9/Jan/2013 comment above, it seems that you have underestimated the work here. I think we should not underestimate again the review time.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Nicholas,

          I was planning on calling a merge vote this week. However, if you want more time, that's fine too.

          How much time do you think you'll need to review this?

          Show
          Colin Patrick McCabe added a comment - Hi Nicholas, I was planning on calling a merge vote this week. However, if you want more time, that's fine too. How much time do you think you'll need to review this?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Could you post a patch first? The latest patch posted here still has test failure and javadoc warning.

          Show
          Tsz Wo Nicholas Sze added a comment - Could you post a patch first? The latest patch posted here still has test failure and javadoc warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566804/full.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 22 new or modified test files.

          -1 javac. The applied patch generated 2021 javac compiler warnings (more than the trunk's current 2013 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy:

          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566804/full.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 22 new or modified test files. -1 javac . The applied patch generated 2021 javac compiler warnings (more than the trunk's current 2013 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3889//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          Looks like all the "new compiler warnings" came from this:

          0a1,8
          > [WARNING] 
          > [WARNING] Some problems were encountered while building the effective model for org.apache.hadoop:hadoop-common:jar:3.0.0-SNAPSHOT
          > [WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.apache.maven.plugins:maven-surefire-plugin @ line 484, column 15
          > [WARNING] 
          > [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
          > [WARNING] 
          > [WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
          > [WARNING] 
          

          It seems that this is a pom.xml problem that was fixed in trunk by HADOOP-9242, but not in this branch.

          As for TestBalancerWithNodeGroup, it's known to be a flaky test-- see HDFS-4376 and HDFS-4261. This branch doesn't change the balancer at all.

          Show
          Colin Patrick McCabe added a comment - Looks like all the "new compiler warnings" came from this: 0a1,8 > [WARNING] > [WARNING] Some problems were encountered while building the effective model for org.apache.hadoop:hadoop-common:jar:3.0.0-SNAPSHOT > [WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.apache.maven.plugins:maven-surefire-plugin @ line 484, column 15 > [WARNING] > [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. > [WARNING] > [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. > [WARNING] It seems that this is a pom.xml problem that was fixed in trunk by HADOOP-9242 , but not in this branch. As for TestBalancerWithNodeGroup, it's known to be a flaky test-- see HDFS-4376 and HDFS-4261 . This branch doesn't change the balancer at all.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          //DomainSocket.java
          +  static {
          +    if (SystemUtils.IS_OS_WINDOWS) {
          +      loadingFailureReason = "UNIX Domain sockets are not available on Windows.";
          +    } else if (!NativeCodeLoader.isNativeCodeLoaded()) {
          +      loadingFailureReason = "libhadoop cannot be loaded.";
          +    } else {
          +      String problem = "DomainSocket#anchorNative got error: unknown";
          +      try {
          +        anchorNative();
          +        problem = null;
          +      } catch (Throwable t) {
          +        problem = "DomainSocket#anchorNative got error: " + t.getMessage();
          +      }
          +      loadingFailureReason = problem;
          +    }
          +  }
          

          In the code above, when would "DomainSocket#anchorNative got error: unknown" be used?

          BTW, do you have a design doc somewhere?

          Show
          Tsz Wo Nicholas Sze added a comment - //DomainSocket.java + static { + if (SystemUtils.IS_OS_WINDOWS) { + loadingFailureReason = "UNIX Domain sockets are not available on Windows." ; + } else if (!NativeCodeLoader.isNativeCodeLoaded()) { + loadingFailureReason = "libhadoop cannot be loaded." ; + } else { + String problem = "DomainSocket#anchorNative got error: unknown" ; + try { + anchorNative(); + problem = null ; + } catch (Throwable t) { + problem = "DomainSocket#anchorNative got error: " + t.getMessage(); + } + loadingFailureReason = problem; + } + } In the code above, when would "DomainSocket#anchorNative got error: unknown" be used? BTW, do you have a design doc somewhere?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It looks that the patch has a lot of unrelated code/changes. It seems that the branch has not merged with the latest trunk.

          Show
          Tsz Wo Nicholas Sze added a comment - It looks that the patch has a lot of unrelated code/changes. It seems that the branch has not merged with the latest trunk.
          Hide
          Colin Patrick McCabe added a comment -

          update design document

          Show
          Colin Patrick McCabe added a comment - update design document
          Hide
          Colin Patrick McCabe added a comment -

          In the code above, when would "DomainSocket#anchorNative got error: unknown" be used?

          it's not used; this assignment could be removed.

          Show
          Colin Patrick McCabe added a comment - In the code above, when would "DomainSocket#anchorNative got error: unknown" be used? it's not used; this assignment could be removed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566903/2013.01.28.design.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3900//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566903/2013.01.28.design.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3900//console This message is automatically generated.
          Hide
          Brandon Li added a comment -

          I did some tests comparing the read performance with and without unix domain socket enabled. The result is not what I expected.

          1. Apply latest patch to trunk and start a one-datanode cluster on a linux box(Linux version 2.6.18-238, 15GB memory).
          2. copy a 1GB local file to HDFS.
          3. read it back 4 times (copyToLocal) without enabling unix domain socket support, and the results:

            read 1 read 2 read 3 read4
          real 0m6.655s 0m7.080s 0m6.680s 0m6.680s
          user 0m4.340s 0m4.533s 0m4.459s 0m4.563s
          sys 0m3.544s 0m3.525s 0m3.484s 0m3.438s

          4. add the following to hdfs-site.xml
          <property>
          <name>dfs.domain.socket.path</name>
          <value>/grid/0/test/unixsocket</value>
          </property>
          <property>
          <name>dfs.client.domain.socket.data.traffic</name>
          <value>true</value>
          </property>
          5. restart the cluster, format namenode, copy 1GB file, read it back 4 times. The new results:

            read 1 read 2 read 3 read4
          real 0m8.296s 0m7.811s 0m7.960s 0m7.803s
          user 0m5.129s 0m5.172s 0m5.197s 0m5.018s
          sys 0m3.643s 0m3.572s 0m3.701s 0m3.910s
          Show
          Brandon Li added a comment - I did some tests comparing the read performance with and without unix domain socket enabled. The result is not what I expected. 1. Apply latest patch to trunk and start a one-datanode cluster on a linux box(Linux version 2.6.18-238, 15GB memory). 2. copy a 1GB local file to HDFS. 3. read it back 4 times (copyToLocal) without enabling unix domain socket support, and the results:   read 1 read 2 read 3 read4 real 0m6.655s 0m7.080s 0m6.680s 0m6.680s user 0m4.340s 0m4.533s 0m4.459s 0m4.563s sys 0m3.544s 0m3.525s 0m3.484s 0m3.438s 4. add the following to hdfs-site.xml <property> <name>dfs.domain.socket.path</name> <value>/grid/0/test/unixsocket</value> </property> <property> <name>dfs.client.domain.socket.data.traffic</name> <value>true</value> </property> 5. restart the cluster, format namenode, copy 1GB file, read it back 4 times. The new results:   read 1 read 2 read 3 read4 real 0m8.296s 0m7.811s 0m7.960s 0m7.803s user 0m5.129s 0m5.172s 0m5.197s 0m5.018s sys 0m3.643s 0m3.572s 0m3.701s 0m3.910s
          Hide
          Colin Patrick McCabe added a comment -

          Hi Brandon,

          What you are testing here is not short-circuit reads, but passing data traffic over UNIX domain sockets, a configuration we don't recommend. (See Todd and my comments about this earlier on this JIRA.)

          If you want to test short-circuit local reads, please set this configuration:

          <property>
            <name>dfs.client.domain.socket.data.traffic</name>
            <value>false</value>
          <property>
          <property>
            <name>dfs.client.read.shortcircuit</name>
            <value>true</value>
          <property>
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/var/run/hdfs/sock._PORT</value>
          <property>
          
          Show
          Colin Patrick McCabe added a comment - Hi Brandon, What you are testing here is not short-circuit reads, but passing data traffic over UNIX domain sockets, a configuration we don't recommend. (See Todd and my comments about this earlier on this JIRA.) If you want to test short-circuit local reads, please set this configuration: <property> <name>dfs.client.domain.socket.data.traffic</name> <value> false </value> <property> <property> <name>dfs.client.read.shortcircuit</name> <value> true </value> <property> <property> <name>dfs.domain.socket.path</name> <value>/ var /run/hdfs/sock._PORT</value> <property>
          Hide
          Brandon Li added a comment -

          Now it looks better with the new configuration:

            read1 read2 read3 read4
          real 0m5.605s 0m5.685s 0m5.524s 0m5.738s
          user 0m3.663s 0m3.639s 0m3.641s 0m3.589s
          sys 0m2.924s 0m2.895s 0m2.871s 0m3.014s

          It would be nice to have a doc describing how to use this feature correctly and its relationship with the previous shortcircuit implementation.

          Show
          Brandon Li added a comment - Now it looks better with the new configuration:   read1 read2 read3 read4 real 0m5.605s 0m5.685s 0m5.524s 0m5.738s user 0m3.663s 0m3.639s 0m3.641s 0m3.589s sys 0m2.924s 0m2.895s 0m2.871s 0m3.014s It would be nice to have a doc describing how to use this feature correctly and its relationship with the previous shortcircuit implementation.
          Hide
          Colin Patrick McCabe added a comment -

          consolidated patch

          Show
          Colin Patrick McCabe added a comment - consolidated patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567408/2013.01.31.consolidated.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3919//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567408/2013.01.31.consolidated.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3919//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567419/2013.01.31.consolidated2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 26 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3920//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3920//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567419/2013.01.31.consolidated2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 26 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3920//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3920//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          Hi Colin Patrick McCabe, would you mind giving a patch against branch-2 if possible ? It'll be appreciated
          I could be a volunteer to do a simple performance test on our hbase test cluster which is built with branch-2, to see whether there is a performance imporement on application-side or not, thanks in advance.

          Show
          Liang Xie added a comment - Hi Colin Patrick McCabe , would you mind giving a patch against branch-2 if possible ? It'll be appreciated I could be a volunteer to do a simple performance test on our hbase test cluster which is built with branch-2, to see whether there is a performance imporement on application-side or not, thanks in advance.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Liang,

          Thanks for offering to do some tests. Unfortunately, at this point, rebasing off of branch-2 would be a lot of work. Are you sure that you can't simply run against trunk?

          There should be no problem running against trunk with something like this:

          cd hbase
          git clean -fdqx
          mvn package -Dtar -Pnative -Dhadoop.profile=2.0 -Dhadoop.version=3.0.0-SNAPSHOT \
            -Dcdh.hadoop.version=3.0.0-SNAPSHOT -DskipTests -Dmaven.javadoc.skip=true
          

          Obviously, when you untar ./target/hbase-0.94.2-cdh4.3.0-SNAPSHOT.tar.gz (or whatever it ends up being called), you have to be sure to replace the 3.0.0-SNAPSHOT Hadoop jars it pulled from upstream with the hadoop jars you built yourself from HDFS-347. But aside from that, it should all just work.

          Show
          Colin Patrick McCabe added a comment - Hi Liang, Thanks for offering to do some tests. Unfortunately, at this point, rebasing off of branch-2 would be a lot of work. Are you sure that you can't simply run against trunk? There should be no problem running against trunk with something like this: cd hbase git clean -fdqx mvn package -Dtar -Pnative -Dhadoop.profile=2.0 -Dhadoop.version=3.0.0-SNAPSHOT \ -Dcdh.hadoop.version=3.0.0-SNAPSHOT -DskipTests -Dmaven.javadoc.skip= true Obviously, when you untar ./target/hbase-0.94.2-cdh4.3.0-SNAPSHOT.tar.gz (or whatever it ends up being called), you have to be sure to replace the 3.0.0-SNAPSHOT Hadoop jars it pulled from upstream with the hadoop jars you built yourself from HDFS-347 . But aside from that, it should all just work.
          Hide
          Suresh Srinivas added a comment -

          The latest merge patch 2013.01.31.consolidated2.patch may not be correct and might be missing trunk changes MAPREDUCE-4893 and MAPREDUCE-4929?

          Show
          Suresh Srinivas added a comment - The latest merge patch 2013.01.31.consolidated2.patch may not be correct and might be missing trunk changes MAPREDUCE-4893 and MAPREDUCE-4929 ?
          Hide
          Aaron T. Myers added a comment -

          Hey Suresh, I think that was just because trunk was last merged into the HDFS-347 branch by me on 1/30, whereas MAPREDUCE-4893 was only merged to trunk on 1/31, and MAPREDUCE-4929 was only committed to branch-1.

          Show
          Aaron T. Myers added a comment - Hey Suresh, I think that was just because trunk was last merged into the HDFS-347 branch by me on 1/30, whereas MAPREDUCE-4893 was only merged to trunk on 1/31, and MAPREDUCE-4929 was only committed to branch-1.
          Hide
          Suresh Srinivas added a comment -

          MAPREDUCE-4929 was only committed to branch-1.

          Sorry I meant MR-4969

          In the current consolidated patch there may be extraneous changes. Updating it will help the review. See some of the changes in CHANGES.txt corresponding common, hdfs and MR.

          Show
          Suresh Srinivas added a comment - MAPREDUCE-4929 was only committed to branch-1. Sorry I meant MR-4969 In the current consolidated patch there may be extraneous changes. Updating it will help the review. See some of the changes in CHANGES.txt corresponding common, hdfs and MR.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patches posted here are very hard to be reviewed. The main reasons are that the code is not well organized and unreadable such as this problem mentioned previously. The patch also uses a lot of short varible/method names and seems to contain unrelated codes. Here are some comments so far for 2013.01.31.consolidated2.patch:

          • DomainSocket.java
            • do not use AtomicInteger for status, add a new class
            • rename fdRef(), fdUnref(boolean), jfds, jbuf, SND_BUF_SIZE, etc.
            • do not override finalize().
          Show
          Tsz Wo Nicholas Sze added a comment - The patches posted here are very hard to be reviewed. The main reasons are that the code is not well organized and unreadable such as this problem mentioned previously . The patch also uses a lot of short varible/method names and seems to contain unrelated codes. Here are some comments so far for 2013.01.31.consolidated2.patch: DomainSocket.java do not use AtomicInteger for status, add a new class rename fdRef(), fdUnref(boolean), jfds, jbuf, SND_BUF_SIZE, etc. do not override finalize().
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569632/2012.02.15.consolidated3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 23 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3975//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3975//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569632/2012.02.15.consolidated3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 23 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3975//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3975//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569636/2013.02.15.consolidated4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3976//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569636/2013.02.15.consolidated4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 12 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3976//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569639/2013.02.15.consolidated4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3977//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3977//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569639/2013.02.15.consolidated4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 21 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3977//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3977//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Took a quick looks at 2013.02.15.consolidated4.patch. It still use finalize() for cleanup. However, we use shutdown hook for cleanup in Hadoop but not finalize.

          Show
          Tsz Wo Nicholas Sze added a comment - Took a quick looks at 2013.02.15.consolidated4.patch. It still use finalize() for cleanup. However, we use shutdown hook for cleanup in Hadoop but not finalize.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Some more comments:

          • I might be wrong: we may not need to change DataTransferProtocol since if local read is enabled, the server may detect if the client is local and then setup the stream using file descriptors.
          • Remove DataNode.CURRENT_BLOCK_FORMAT_VERSION since we currently only have one version. We may add it in the future if necessary.
          • Why adding FsDatasetSpi.getShortCircuitFdsForRead but not using getBlockInputStream and getMetaDataInputStream?
          • The following is hard to understand:
            • Why using two String.format()?
            • Why using %s for string constants?
          +        BlockSender.ClientTraceLog.info(String.format(
          +          String.format(
          +            "src: %s, dest: %s, op: %s, blockid: %s, srvID: %s, " +
          +              "success: %b",
          +            "127.0.0.1",                   // src IP
          +            "127.0.0.1",                   // dst IP
          +            "REQUEST_SHORT_CIRCUIT_FDS",   // operation
          +            blk.getBlockId(),             // block id
          +            dnR.getStorageID(),
          +            (fis != null)
          +          )));
          
          Show
          Tsz Wo Nicholas Sze added a comment - Some more comments: I might be wrong: we may not need to change DataTransferProtocol since if local read is enabled, the server may detect if the client is local and then setup the stream using file descriptors. Remove DataNode.CURRENT_BLOCK_FORMAT_VERSION since we currently only have one version. We may add it in the future if necessary. Why adding FsDatasetSpi.getShortCircuitFdsForRead but not using getBlockInputStream and getMetaDataInputStream? The following is hard to understand: Why using two String.format()? Why using %s for string constants? + BlockSender.ClientTraceLog.info( String .format( + String .format( + "src: %s, dest: %s, op: %s, blockid: %s, srvID: %s, " + + "success: %b" , + "127.0.0.1" , // src IP + "127.0.0.1" , // dst IP + "REQUEST_SHORT_CIRCUIT_FDS" , // operation + blk.getBlockId(), // block id + dnR.getStorageID(), + (fis != null ) + )));
          Hide
          Colin Patrick McCabe added a comment -
          • I agree, we should remove the finalize on DomainSocket.
          • Not all clients want to do short-circuit local reads. There is no way the server can detect this without the client explicitly asking.
          • getMetaDataInputStream returns a LengthInputStream, but what we need is a FileInputStream. So I don't think that method could be used, without some refactoring.
          • I agree that using String.format twice is incorrect. We should fix this.
          Show
          Colin Patrick McCabe added a comment - I agree, we should remove the finalize on DomainSocket. Not all clients want to do short-circuit local reads. There is no way the server can detect this without the client explicitly asking. I am not sure what you are suggesting by "Remove CURRENT_BLOCK_FORMAT_VERSION." Are you suggesting that we use a naked constant here? That seems to go against section 10.3 of the Java style guide. http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-137265.html#1255 Perhaps you are thinking that the constant should go somewhere else? getMetaDataInputStream returns a LengthInputStream , but what we need is a FileInputStream . So I don't think that method could be used, without some refactoring. I agree that using String.format twice is incorrect. We should fix this.
          Hide
          Todd Lipcon added a comment -

          It still use finalize() for cleanup. However, we use shutdown hook for cleanup in Hadoop but not finalize.

          I don't follow – using finalize to close a socket is exactly what it's meant for. Otherwise, we will leak a file descriptor if someone forgets to close() a DomainSocket. I don't see how it's at all related to the shutdown hook. Given that DomainSocket is the parallel of the JDK Socket implementation, we should follow their example:

              /**
               * Cleans up if the user forgets to close it.
               */
              protected void finalize() throws IOException {
                  close();
              }
          

          (from classes/java/net/AbstractPlainSocketImpl.java in the JDK7 source)

          Show
          Todd Lipcon added a comment - It still use finalize() for cleanup. However, we use shutdown hook for cleanup in Hadoop but not finalize. I don't follow – using finalize to close a socket is exactly what it's meant for. Otherwise, we will leak a file descriptor if someone forgets to close() a DomainSocket. I don't see how it's at all related to the shutdown hook. Given that DomainSocket is the parallel of the JDK Socket implementation, we should follow their example: /** * Cleans up if the user forgets to close it. */ protected void finalize() throws IOException { close(); } (from classes/java/net/AbstractPlainSocketImpl.java in the JDK7 source)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Not all clients want to do short-circuit local reads. ...

          Why some clients don't want to do short-circuit? Could you give an example?

          > I am not sure what you are suggesting by "Remove CURRENT_BLOCK_FORMAT_VERSION." ...

          I mean we might not need to compare version since we currently only has one. If we have a new version in the future, the server could then detect the old clients and fail them. Sound good?

          > getMetaDataInputStream returns a LengthInputStream, but what we need is a FileInputStream. ...

          LengthInputStream is a FilterInputStream. It is easy to return the underlying input strem from FilterInputStream.

          One more question:

          • How to configure the existing short-circuit read (HDFS-2246) after the patch?
          Show
          Tsz Wo Nicholas Sze added a comment - > Not all clients want to do short-circuit local reads. ... Why some clients don't want to do short-circuit? Could you give an example? > I am not sure what you are suggesting by "Remove CURRENT_BLOCK_FORMAT_VERSION." ... I mean we might not need to compare version since we currently only has one. If we have a new version in the future, the server could then detect the old clients and fail them. Sound good? > getMetaDataInputStream returns a LengthInputStream, but what we need is a FileInputStream. ... LengthInputStream is a FilterInputStream. It is easy to return the underlying input strem from FilterInputStream. One more question: How to configure the existing short-circuit read ( HDFS-2246 ) after the patch?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Given that DomainSocket is the parallel of the JDK Socket implementation, we should follow their example: ...

          Do you mean java.net.Socket? I have checked a few versions of JDK and could not find finalize(). I may be missing something. Where did you find the source code?

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Given that DomainSocket is the parallel of the JDK Socket implementation, we should follow their example: ... Do you mean java.net.Socket? I have checked a few versions of JDK and could not find finalize(). I may be missing something. Where did you find the source code? http://www.docjar.com/html/api/java/net/Socket.java.html http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/net/Socket.java.html
          Hide
          Harsh J added a comment -

          Nicholas,

          Todd has mentioned the reference point in his post "(from classes/java/net/AbstractPlainSocketImpl.java in the JDK7 source)". This is viewable online at http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/acd5ac174459/src/share/classes/java/net/AbstractPlainSocketImpl.java.

          Show
          Harsh J added a comment - Nicholas, Todd has mentioned the reference point in his post "(from classes/java/net/AbstractPlainSocketImpl.java in the JDK7 source)". This is viewable online at http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/acd5ac174459/src/share/classes/java/net/AbstractPlainSocketImpl.java .
          Hide
          Harsh J added a comment -

          How to configure the existing short-circuit read (HDFS-2246) after the patch?

          There is add-on config required, see https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13565707&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13565707 (commented earlier here).

          Show
          Harsh J added a comment - How to configure the existing short-circuit read ( HDFS-2246 ) after the patch? There is add-on config required, see https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13565707&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13565707 (commented earlier here).
          Hide
          Harsh J added a comment -

          (Ignore above - I think you were asking how to use the alternative/older way, my bad).

          Show
          Harsh J added a comment - (Ignore above - I think you were asking how to use the alternative/older way, my bad).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          For finalize(), it currently is only used in the tests so that I think we should remove them.

          > I don't follow – using finalize to close a socket is exactly what it's meant for. Otherwise, we will leak a file descriptor if someone forgets to close() a DomainSocket. I don't see how it's at all related to the shutdown hook. Given that DomainSocket is the parallel of the JDK Socket implementation, ...

          Todd, do you suggest that DomainSocket should override it? In the latest patch, it does not.

          Show
          Tsz Wo Nicholas Sze added a comment - For finalize(), it currently is only used in the tests so that I think we should remove them. > I don't follow – using finalize to close a socket is exactly what it's meant for. Otherwise, we will leak a file descriptor if someone forgets to close() a DomainSocket. I don't see how it's at all related to the shutdown hook. Given that DomainSocket is the parallel of the JDK Socket implementation, ... Todd, do you suggest that DomainSocket should override it? In the latest patch, it does not.
          Hide
          Suresh Srinivas added a comment -

          How to configure the existing short-circuit read (HDFS-2246) after the patch?

          Colin can you please answer this?

          I was looking for relevant information from the design doc:

          We support backwards compatibility: clients using the existing short-circuit read implementation can talk
          with clients that use the new short-circuit read implementation. This is why the existing GetBlockLocations
          RPC is still supported.

          This does not read correctly. Clients talking to clients? Should one of them be server?

          Show
          Suresh Srinivas added a comment - How to configure the existing short-circuit read ( HDFS-2246 ) after the patch? Colin can you please answer this? I was looking for relevant information from the design doc: We support backwards compatibility: clients using the existing short-circuit read implementation can talk with clients that use the new short-circuit read implementation. This is why the existing GetBlockLocations RPC is still supported. This does not read correctly. Clients talking to clients? Should one of them be server?
          Hide
          Todd Lipcon added a comment -

          Todd, do you suggest that DomainSocket should override it? In the latest patch, it does not.

          Yes, it looks like Colin removed it on 2/16 following your review comments (a5a3d4147060a8b7e81c5d8050fb6393344a7d74). However, I don't think that was correct.

          As for the old HDFS-2246 path still being available: it's not available any more. The new way is strictly better. The old way encourages insecure setups and is slower, and I don't think we should maintain it. As I said when that method was implemented, I planned to -1 it if it were seen as the "final solution" instead of just a stopgap hack on the way to the correct solution (this JIRA). Now that we have the correct one, please let us kill off the old code.

          Show
          Todd Lipcon added a comment - Todd, do you suggest that DomainSocket should override it? In the latest patch, it does not. Yes, it looks like Colin removed it on 2/16 following your review comments (a5a3d4147060a8b7e81c5d8050fb6393344a7d74). However, I don't think that was correct. As for the old HDFS-2246 path still being available: it's not available any more. The new way is strictly better. The old way encourages insecure setups and is slower, and I don't think we should maintain it. As I said when that method was implemented, I planned to -1 it if it were seen as the "final solution" instead of just a stopgap hack on the way to the correct solution (this JIRA). Now that we have the correct one, please let us kill off the old code.
          Hide
          Colin Patrick McCabe added a comment -

          Why some clients don't want to do short-circuit? Could you give an example?

          When using short-circuit local reads, you don't get all of the metrics that you get with regular reads.

          LengthInputStream is a FilterInputStream. It is easy to return the underlying input strem from FilterInputStream.

          Can you be more specific about how you would like to do this?

          How to configure the existing short-circuit read (HDFS-2246) after the patch?

          On the DataNode side, the configuration parameters for old-style short-circuit local reads haven't changed. On the client side, using old-style short-circuit local reads is not possible. The server-side code is there only to provide backwards compatibility. In other words, it is there to provide interoperability between older clients and newer servers. We don't have to maintain it forever, but I think we at least want the backwards compatibility code in 2.0.x.

          I mean we might not need to compare version since we currently only has one. If we have a new version in the future, the server could then detect the old clients and fail them. Sound good?

          My fear is that if we don't think through the compatibility issues now, we'll have more bugs like HDFS-4506.

          Show
          Colin Patrick McCabe added a comment - Why some clients don't want to do short-circuit? Could you give an example? When using short-circuit local reads, you don't get all of the metrics that you get with regular reads. LengthInputStream is a FilterInputStream. It is easy to return the underlying input strem from FilterInputStream. Can you be more specific about how you would like to do this? How to configure the existing short-circuit read ( HDFS-2246 ) after the patch? On the DataNode side, the configuration parameters for old-style short-circuit local reads haven't changed. On the client side, using old-style short-circuit local reads is not possible. The server-side code is there only to provide backwards compatibility. In other words, it is there to provide interoperability between older clients and newer servers. We don't have to maintain it forever, but I think we at least want the backwards compatibility code in 2.0.x. I mean we might not need to compare version since we currently only has one. If we have a new version in the future, the server could then detect the old clients and fail them. Sound good? My fear is that if we don't think through the compatibility issues now, we'll have more bugs like HDFS-4506 .
          Hide
          Suresh Srinivas added a comment -

          I have already posted some comments on the merge vote. To make it very clear, I am -1 on removing HDFS-2246 as a part of this jira.

          In future, design document or description should explicitly state removal of functionality instead of obscure comments.

          Show
          Suresh Srinivas added a comment - I have already posted some comments on the merge vote. To make it very clear, I am -1 on removing HDFS-2246 as a part of this jira. In future, design document or description should explicitly state removal of functionality instead of obscure comments.
          Hide
          Colin Patrick McCabe added a comment -

          In future, design document or description should explicitly state removal of functionality instead of obscure comments.

          The design document posted on January 28th explicitly states that file paths are no longer going to be used.

          Section 3, part 3:

          3. Changes to BlockReaderLocal to allow it to make use of FileInputStreams rather than file paths.

          Show
          Colin Patrick McCabe added a comment - In future, design document or description should explicitly state removal of functionality instead of obscure comments. The design document posted on January 28th explicitly states that file paths are no longer going to be used. Section 3, part 3: 3. Changes to BlockReaderLocal to allow it to make use of FileInputStreams rather than file paths.
          Hide
          Suresh Srinivas added a comment -

          Changes to BlockReaderLocal to allow it to make use of FileInputStreams rather than file paths.

          I closely worked on getting the short circuit patch into branch-1 and trunk. Reading the doc I had hard time translating this to HDFS-2246 functionality will be removed.

          The explicit way to state this is to say, HDFS-2246 mechanism will be deprecated/removed.

          Show
          Suresh Srinivas added a comment - Changes to BlockReaderLocal to allow it to make use of FileInputStreams rather than file paths. I closely worked on getting the short circuit patch into branch-1 and trunk. Reading the doc I had hard time translating this to HDFS-2246 functionality will be removed. The explicit way to state this is to say, HDFS-2246 mechanism will be deprecated/removed.
          Hide
          Colin Patrick McCabe added a comment -

          patch for jenkins

          Show
          Colin Patrick McCabe added a comment - patch for jenkins
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12576416/2013-04-01-jenkins.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 23 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common:

          org.apache.hadoop.fs.TestFcHdfsSymlink

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4174//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4174//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12576416/2013-04-01-jenkins.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 23 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common: org.apache.hadoop.fs.TestFcHdfsSymlink +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4174//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4174//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          build failure is https://issues.apache.org/jira/browse/HDFS-4653, introduced from trunk by merge.

          Show
          Colin Patrick McCabe added a comment - build failure is https://issues.apache.org/jira/browse/HDFS-4653 , introduced from trunk by merge.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Colin, FsDatasetSpi.getShortCircuitFdsForRead is still in the latest patch. I guess you have not yet addressed the previous comments.

          Show
          Tsz Wo Nicholas Sze added a comment - Colin, FsDatasetSpi.getShortCircuitFdsForRead is still in the latest patch. I guess you have not yet addressed the previous comments.
          Hide
          Suresh Srinivas added a comment -

          Colin, FsDatasetSpi.getShortCircuitFdsForRead is still in the latest patch. I guess you have not yet addressed the previous comments.

          Colin, can you respond to this, in this jira?

          Also does the design doc need to be updated?

          Show
          Suresh Srinivas added a comment - Colin, FsDatasetSpi.getShortCircuitFdsForRead is still in the latest patch. I guess you have not yet addressed the previous comments. Colin, can you respond to this, in this jira? Also does the design doc need to be updated?
          Hide
          Colin Patrick McCabe added a comment -

          https://issues.apache.org/jira/browse/HDFS-4661 removes FsDatasetSpi.getShortCircuitFdsForRead as Nicholas suggested. It is not yet committed.

          The design doc should be current, except for the fact that it doesn't mention BlockReaderLocalLegacy.

          Show
          Colin Patrick McCabe added a comment - https://issues.apache.org/jira/browse/HDFS-4661 removes FsDatasetSpi.getShortCircuitFdsForRead as Nicholas suggested. It is not yet committed. The design doc should be current, except for the fact that it doesn't mention BlockReaderLocalLegacy .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patch (2013-04-01-jenkins.patch) again has some unrelated code. What is the intention to include the unrelated code? Even for Jenkins testing, the result is invalid with the unrelated code.

          Show
          Tsz Wo Nicholas Sze added a comment - The patch (2013-04-01-jenkins.patch) again has some unrelated code. What is the intention to include the unrelated code? Even for Jenkins testing, the result is invalid with the unrelated code.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          //FileInputStreamCache.CacheCleaner
                  for (Iterator<Entry<Key, Value>> iter = map.entries().iterator();
                        iter.hasNext();
                        iter = map.entries().iterator()) {
                    Entry<Key, Value> entry = iter.next();
                    if (entry.getValue().getTime() + expiryTimeMs >= curTime) {
                      break;
                    }
                    entry.getValue().close();
                    iter.remove();
                  }
          
          • The above setting "iter = map.entries().iterator()" in each loop seems a bug.
          • If (entry.getValue().getTime() + expiryTimeMs >= curTime), why break but not close?
          Show
          Tsz Wo Nicholas Sze added a comment - //FileInputStreamCache.CacheCleaner for (Iterator<Entry<Key, Value>> iter = map.entries().iterator(); iter.hasNext(); iter = map.entries().iterator()) { Entry<Key, Value> entry = iter.next(); if (entry.getValue().getTime() + expiryTimeMs >= curTime) { break ; } entry.getValue().close(); iter.remove(); } The above setting "iter = map.entries().iterator()" in each loop seems a bug. If (entry.getValue().getTime() + expiryTimeMs >= curTime), why break but not close?
          Hide
          Tsz Wo Nicholas Sze added a comment -
          //FileInputStreamCache.Key.equals
                return (block.equals(otherKey.block) & 
                    (block.getGenerationStamp() == otherKey.block.getGenerationStamp()) &
                    datanodeID.equals(otherKey.datanodeID));
          
          • Why use & but not &&? Is it a typo?
          Show
          Tsz Wo Nicholas Sze added a comment - //FileInputStreamCache.Key.equals return (block.equals(otherKey.block) & (block.getGenerationStamp() == otherKey.block.getGenerationStamp()) & datanodeID.equals(otherKey.datanodeID)); Why use & but not &&? Is it a typo?
          Hide
          Colin Patrick McCabe added a comment -

          The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. It is not for review. It was generated with 'git diff master' run from the HDFS-347 branch. You could do a similar thing with subverison by checking out two copies and diffing them. I recommend looking at the commits in subversion or git.

          In FileInputStreamCache.CacheCleaner, we look at the first element in the cache every time. Basically, on every iteration through the loop, we consider the question of whether to delete the first entry, or exit the loop. There is no bug. Check the unit test TestFileInputStreamCache.

          In FileInputStreamCache.Key#equals, I do agree that the short-circuit AND operation should be used in preference to the non-short-circuit. Mind adding this to HDFS-4661?

          Show
          Colin Patrick McCabe added a comment - The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. It is not for review. It was generated with 'git diff master' run from the HDFS-347 branch. You could do a similar thing with subverison by checking out two copies and diffing them. I recommend looking at the commits in subversion or git. In FileInputStreamCache.CacheCleaner , we look at the first element in the cache every time. Basically, on every iteration through the loop, we consider the question of whether to delete the first entry, or exit the loop. There is no bug. Check the unit test TestFileInputStreamCache . In FileInputStreamCache.Key#equals , I do agree that the short-circuit AND operation should be used in preference to the non-short-circuit. Mind adding this to HDFS-4661 ?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. ...

          Even for Jenkins, it is invalid since the patch submitted contains unrelated code. We could not tell if Jenkins still +1 if the unrelated code is removed. Does it make sense?

          > In FileInputStreamCache.CacheCleaner, we look at the first element in the cache every time. ...

          The code "iter = map.entries().iterator()" can be removed with the same result since the (previous) first element must be removed.

          > If (entry.getValue().getTime() + expiryTimeMs >= curTime), why break but not close?

          Oops, break is correct here.

          Show
          Tsz Wo Nicholas Sze added a comment - > The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. ... Even for Jenkins, it is invalid since the patch submitted contains unrelated code. We could not tell if Jenkins still +1 if the unrelated code is removed. Does it make sense? > In FileInputStreamCache.CacheCleaner, we look at the first element in the cache every time. ... The code "iter = map.entries().iterator()" can be removed with the same result since the (previous) first element must be removed. > If (entry.getValue().getTime() + expiryTimeMs >= curTime), why break but not close? Oops, break is correct here.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Colin, I hope you understand that I also want this feature get in asap. However, we need to first clean up the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Colin, I hope you understand that I also want this feature get in asap. However, we need to first clean up the patch.
          Hide
          Colin Patrick McCabe added a comment -

          can you please add these style comments to the style cleanup JIRA HDFS-4661? Loading this JIRA makes my web browser slow to a crawl.

          Show
          Colin Patrick McCabe added a comment - can you please add these style comments to the style cleanup JIRA HDFS-4661 ? Loading this JIRA makes my web browser slow to a crawl.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... . It was generated with 'git diff master' run from the HDFS-347 branch. ...

          In case that you have difficulty on generating a patch without extra code. The steps below seem working fine.

          1. Use git log to find the latest trunk commit merged to the branch. In this case, it is YARN-460.
          2. Switch to trunk and use git log to find the commit id for YARN-460 from trunk
            commit b6c6c66860fcc00f47049786bb7772f981faf100
            Author: Thomas Graves <tgraves@apache.org>
            Date:   Fri Mar 29 14:36:53 2013 +0000
            
                YARN-460. CS user left in list of active users for the queue even when application finished (tgraves)
                
                git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1462486 13f79535-47bb-0310-9956-ffa450edef68
            
          3. Switch back to the branch and then run git diff with the commit id, i.e.
            git diff b6c6c66860fcc00f47049786bb7772f981faf100
            
          Show
          Tsz Wo Nicholas Sze added a comment - > ... . It was generated with 'git diff master' run from the HDFS-347 branch. ... In case that you have difficulty on generating a patch without extra code. The steps below seem working fine. Use git log to find the latest trunk commit merged to the branch. In this case, it is YARN-460 . Switch to trunk and use git log to find the commit id for YARN-460 from trunk commit b6c6c66860fcc00f47049786bb7772f981faf100 Author: Thomas Graves <tgraves@apache.org> Date: Fri Mar 29 14:36:53 2013 +0000 YARN-460. CS user left in list of active users for the queue even when application finished (tgraves) git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1462486 13f79535-47bb-0310-9956-ffa450edef68 Switch back to the branch and then run git diff with the commit id, i.e. git diff b6c6c66860fcc00f47049786bb7772f981faf100
          Hide
          Colin Patrick McCabe added a comment -

          The purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. In order to do that, the patch needs to be such that applying it to a directory containing the trunk code results in exactly the code which is in the HDFS-347 branch.

          What you have suggested may be helpful for review. If it is, feel free to use it locally, since you clearly know how to generate it. But there is no need to post it here. And if you do post it, it will just result in Jenkins spitting out a "build failure" message. If you don't believe me, try it yourself.

          Show
          Colin Patrick McCabe added a comment - The purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. In order to do that, the patch needs to be such that applying it to a directory containing the trunk code results in exactly the code which is in the HDFS-347 branch. What you have suggested may be helpful for review. If it is, feel free to use it locally, since you clearly know how to generate it. But there is no need to post it here. And if you do post it, it will just result in Jenkins spitting out a "build failure" message. If you don't believe me, try it yourself.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Colin, as mentioned previously, the patches generated with extra code do not helpful for Jenkins runs. Jenkins should not run with extra code. Hope you understand.

          Show
          Tsz Wo Nicholas Sze added a comment - Colin, as mentioned previously, the patches generated with extra code do not helpful for Jenkins runs. Jenkins should not run with extra code. Hope you understand.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure, let's try.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure, let's try.
          Hide
          Colin Patrick McCabe added a comment -

          The difference between my patch and yours is this.

          diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java hadoop2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ ApplicationConstants.java
          45,50d44
          <   
          <   /**
          <    * The environment variable for APPLICATION_ATTEMPT_ID. Set in AppMaster
          <    * environment only
          <    */
          <   public static final String AM_APP_ATTEMPT_ID_ENV = "AM_APP_ATTEMPT_ID";
          diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java hadoop2/      hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java
          60,63c60,64
          <  * which it spawns the AM in another process and passes it the attempt id via
          <  * env variable ApplicationConstants.AM_APP_ATTEMPT_ID_ENV. The AM can be in any
          <  * language. The AM can register with the RM using the attempt id and proceed as
          <  * normal. The client redirects app stdout and stderr to its own stdout and
          ---
          >  * which it spawns the AM in another process and passes it the container id via
          >  * env variable ApplicationConstants.AM_CONTAINER_ID_ENV. The AM can be in any
          >  * language. The AM can register with the RM using the attempt id obtained
          >  * from the container id and proceed as normal.
          >  * The client redirects app stdout and stderr to its own stdout and
          diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerResponsePBImpl.java hadoop2/hadoop-yarn-     project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerResponsePBImpl.java
          100a101
          >     rebuild = true;
          117c118,119
          <       return;
          ---
          >     } else {
          >       builder.setNodeAction(convertToProtoFormat(nodeAction));
          119c121
          <     builder.setNodeAction(convertToProtoFormat(nodeAction));
          ---
          >     rebuild = true;
          Only in hadoop2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords: TestRegisterNodeManagerResponse.java
          

          Another way of looking at it is that in a few cases, after my patch is applied, I have the "older" versions of certain things, and you do not. In this case, it's harmless. However, if someone had made a conflicting change in trunk since the last integrate-from-trunk, your patch would have broken. This happened several times in the lifetime of the HDFS-347 branch. One example is HDFS-4595. So I do think you will pass Jenkins, but only accidentally.

          It would be nice if we set up Jenkins so that it could build branches. If we did this, what it would be testing is effectively the patch I posted-- not the one you did.

          Show
          Colin Patrick McCabe added a comment - The difference between my patch and yours is this. diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java hadoop2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ ApplicationConstants.java 45,50d44 < < /** < * The environment variable for APPLICATION_ATTEMPT_ID. Set in AppMaster < * environment only < */ < public static final String AM_APP_ATTEMPT_ID_ENV = "AM_APP_ATTEMPT_ID" ; diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java hadoop2/ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java 60,63c60,64 < * which it spawns the AM in another process and passes it the attempt id via < * env variable ApplicationConstants.AM_APP_ATTEMPT_ID_ENV. The AM can be in any < * language. The AM can register with the RM using the attempt id and proceed as < * normal. The client redirects app stdout and stderr to its own stdout and --- > * which it spawns the AM in another process and passes it the container id via > * env variable ApplicationConstants.AM_CONTAINER_ID_ENV. The AM can be in any > * language. The AM can register with the RM using the attempt id obtained > * from the container id and proceed as normal. > * The client redirects app stdout and stderr to its own stdout and diff '--exclude=.git' -r hadoop1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerResponsePBImpl.java hadoop2/hadoop-yarn- project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerResponsePBImpl.java 100a101 > rebuild = true ; 117c118,119 < return ; --- > } else { > builder.setNodeAction(convertToProtoFormat(nodeAction)); 119c121 < builder.setNodeAction(convertToProtoFormat(nodeAction)); --- > rebuild = true ; Only in hadoop2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords: TestRegisterNodeManagerResponse.java Another way of looking at it is that in a few cases, after my patch is applied, I have the "older" versions of certain things, and you do not. In this case, it's harmless. However, if someone had made a conflicting change in trunk since the last integrate-from-trunk, your patch would have broken. This happened several times in the lifetime of the HDFS-347 branch. One example is HDFS-4595 . So I do think you will pass Jenkins, but only accidentally. It would be nice if we set up Jenkins so that it could build branches. If we did this, what it would be testing is effectively the patch I posted-- not the one you did.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12577111/a.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 23 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4187//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4187//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577111/a.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 23 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4187//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4187//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • A few hours earlier, Colin said,

            What you have suggested may be helpful for review. If it is, feel free to use it locally, since you clearly know how to generate it. But there is no need to post it here. And if you do post it, it will just result in Jenkins spitting out a "build failure" message. If you don't believe me, try it yourself.

          • Colin's previous comment:

            Another way of looking at it is that in a few cases, after my patch is applied, I have the "older" versions of certain things, and you do not. In this case, it's harmless. However, if someone had made a conflicting change in trunk since the last integrate-from-trunk, your patch would have broken. This happened several times in the lifetime of the HDFS-347 branch. One example is HDFS-4595. So I do think you will pass Jenkins, but only accidentally.

            It would be nice if we set up Jenkins so that it could build branches. If we did this, what it would be testing is effectively the patch I posted-- not the one you did.

          Colin, you are amazing! You said in a few hours earlier that the build would fail. I didn't believe you and gave a try. Then, the build succeeded and then you could explain it.

          Show
          Tsz Wo Nicholas Sze added a comment - A few hours earlier, Colin said, What you have suggested may be helpful for review. If it is, feel free to use it locally, since you clearly know how to generate it. But there is no need to post it here. And if you do post it, it will just result in Jenkins spitting out a "build failure" message. If you don't believe me, try it yourself. Colin's previous comment: Another way of looking at it is that in a few cases, after my patch is applied, I have the "older" versions of certain things, and you do not. In this case, it's harmless. However, if someone had made a conflicting change in trunk since the last integrate-from-trunk, your patch would have broken. This happened several times in the lifetime of the HDFS-347 branch. One example is HDFS-4595 . So I do think you will pass Jenkins, but only accidentally. It would be nice if we set up Jenkins so that it could build branches. If we did this, what it would be testing is effectively the patch I posted-- not the one you did. Colin, you are amazing! You said in a few hours earlier that the build would fail. I didn't believe you and gave a try. Then, the build succeeded and then you could explain it.
          Hide
          Suresh Srinivas added a comment -

          I have refrained from commenting. What I have seen is lack of understanding, not being receptive to suggestions, and complete disregard for the comments posted by a long term committer.

          The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. It is not for review. It was generated with 'git diff master' run from the HDFS-347 branch. You could do a similar thing with subverison by checking out two copies and diffing them. I recommend looking at the commits in subversion or git.

          What is the basis of this statement? I had several times mentioned in the voting thread that, I choose to review merge patch with several reasons why it works for me. Still you make the statement that "The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run".

          Repeated requests for generating clean patch has been ignored.

          Yes, you need to generate the right patch. Not just a patch that is close to the right patch. If the patch is not the right one, Jenkins +1 has no meaning. Many comments have tried to indicate this, several times here, here and here. If this is not clear, you could ask others to guide you on how to do it (though the helpful tips on how to do it are ignored).

          can you please add these style comments to the style cleanup JIRA HDFS-4661? Loading this JIRA makes my web browser slow to a crawl.

          I can load this on my mobile device over wireless connection. No problem. If the comments are related to the clean merge patch, it should be made in this jira, isn't it?

          The difference between my patch and yours is this...

          After describing how to do it, you chose to ignore. After it is done and works, you seem to be saying it does not count because the different is not significant? It does not matter whether it is a few lines of difference or one line. Right merge patch needs to be generated.

          In the interest of making progress on this issue, Todd Lipcon can you please help merge the patch with cleanly and correctly generated merge patch? If you are busy, I will work with Tsz Wo Nicholas Sze on getting patch merged to trunk. Tsz Wo Nicholas Sze can you please indicate if a +1 from Jenkins on a clean merge patch is sufficient to merge this change to trunk or you would like to see any more changes?

          Show
          Suresh Srinivas added a comment - I have refrained from commenting. What I have seen is lack of understanding, not being receptive to suggestions, and complete disregard for the comments posted by a long term committer. The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run. It is not for review. It was generated with 'git diff master' run from the HDFS-347 branch. You could do a similar thing with subverison by checking out two copies and diffing them. I recommend looking at the commits in subversion or git. What is the basis of this statement? I had several times mentioned in the voting thread that, I choose to review merge patch with several reasons why it works for me. Still you make the statement that "The only purpose of the patch with 'Jenkins' in the name is to get a Jenkins run". Repeated requests for generating clean patch has been ignored. Yes, you need to generate the right patch. Not just a patch that is close to the right patch. If the patch is not the right one, Jenkins +1 has no meaning. Many comments have tried to indicate this, several times here , here and here . If this is not clear, you could ask others to guide you on how to do it (though the helpful tips on how to do it are ignored). can you please add these style comments to the style cleanup JIRA HDFS-4661 ? Loading this JIRA makes my web browser slow to a crawl. I can load this on my mobile device over wireless connection. No problem. If the comments are related to the clean merge patch, it should be made in this jira, isn't it? The difference between my patch and yours is this... After describing how to do it, you chose to ignore. After it is done and works, you seem to be saying it does not count because the different is not significant? It does not matter whether it is a few lines of difference or one line. Right merge patch needs to be generated. In the interest of making progress on this issue, Todd Lipcon can you please help merge the patch with cleanly and correctly generated merge patch? If you are busy, I will work with Tsz Wo Nicholas Sze on getting patch merged to trunk. Tsz Wo Nicholas Sze can you please indicate if a +1 from Jenkins on a clean merge patch is sufficient to merge this change to trunk or you would like to see any more changes?
          Hide
          Colin Patrick McCabe added a comment -

          Hi Suresh,

          I am sorry if I caused you any offense. I view this as a pretty minor disagreement; hopefully you agree.

          If this is the way we want to do Jenkins runs on branches going forward, let's get consensus from everyone and document that here:

          http://wiki.apache.org/hadoop/HowToContribute

          Right now I don't see anything about getting Jenkins runs on branches there. If there is another reference let me know; otherwise, I think we need more clarity.

          Show
          Colin Patrick McCabe added a comment - Hi Suresh, I am sorry if I caused you any offense. I view this as a pretty minor disagreement; hopefully you agree. If this is the way we want to do Jenkins runs on branches going forward, let's get consensus from everyone and document that here: http://wiki.apache.org/hadoop/HowToContribute Right now I don't see anything about getting Jenkins runs on branches there. If there is another reference let me know; otherwise, I think we need more clarity.
          Hide
          Aaron T. Myers added a comment -

          Since the merge vote passed, I have merged the branch to trunk. Leaving the JIRA open for now until we also do the merge to branch-2.

          Colin, thanks a ton for the monster contribution. This is a long time in coming.

          Show
          Aaron T. Myers added a comment - Since the merge vote passed, I have merged the branch to trunk. Leaving the JIRA open for now until we also do the merge to branch-2. Colin, thanks a ton for the monster contribution. This is a long time in coming.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3612 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3612/)
          HDFS-347. DFS read performance suboptimal when client co-located on nodes with data. Contributed by Colin Patrick McCabe. (Revision 1467538)

          Result = SUCCESS
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467538
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/pom.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/CMakeLists.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketInputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketOutputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocket.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TemporarySocketDirectory.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TestDomainSocket.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocalLegacy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DomainSocketFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/FileInputStreamCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/SocketCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/BasicInetPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/EncryptedPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/NioInetPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/Peer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/PeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/TcpPeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/DataTransferProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Receiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Sender.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/JspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/ClientDatanodeProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/datatransfer.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/ShortCircuitLocalReads.apt.vm
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BlockReaderTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocalLegacy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientBlockVerification.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestConnCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDataTransferKeepalive.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDisableConnCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileInputStreamCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelLocalRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelReadUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitLegacyRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadNoChecksum.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadUnCached.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelUnixDomainRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPeerCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestShortCircuitLocalRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSocketCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockTokenWithDFS.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3612 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3612/ ) HDFS-347 . DFS read performance suboptimal when client co-located on nodes with data. Contributed by Colin Patrick McCabe. (Revision 1467538) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467538 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/pom.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/CMakeLists.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketInputStream.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketOutputStream.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocket.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.h /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TemporarySocketDirectory.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TestDomainSocket.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderFactory.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocalLegacy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DomainSocketFactory.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/FileInputStreamCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/SocketCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/BasicInetPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/EncryptedPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/NioInetPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/Peer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/PeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/TcpPeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/DataTransferProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Receiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Sender.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/JspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/ClientDatanodeProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/datatransfer.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/ShortCircuitLocalReads.apt.vm /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BlockReaderTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocalLegacy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientBlockVerification.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestConnCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDataTransferKeepalive.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDisableConnCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileInputStreamCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelLocalRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelReadUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitLegacyRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadNoChecksum.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadUnCached.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelUnixDomainRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPeerCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestShortCircuitLocalRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSocketCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockTokenWithDFS.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-project/src/site/site.xml
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Merge without completing all the sub-tasks?

          Show
          Tsz Wo Nicholas Sze added a comment - Merge without completing all the sub-tasks?
          Hide
          Aaron T. Myers added a comment -

          Nicholas,

          There is quite a bit of precedent for merging a branch to trunk before all of the sub-tasks are completed, if the branch as-is is functional and won't break trunk and those sub-tasks can reasonably be continued on trunk, e.g. HDFS-1623, HDFS-3077, and HDFS-1073. Given that Suresh had suggested in response to your -1 on the merge vote that we continue working on this feedback on trunk, and given that you then removed your -1 and the vote passed, it seems to me like this branch was ready for a merge from trunk.

          Yes, there are a few more small cleanup sub-tasks to work on, but that can be done on trunk. Let's continue the work there.

          Show
          Aaron T. Myers added a comment - Nicholas, There is quite a bit of precedent for merging a branch to trunk before all of the sub-tasks are completed, if the branch as-is is functional and won't break trunk and those sub-tasks can reasonably be continued on trunk, e.g. HDFS-1623 , HDFS-3077 , and HDFS-1073 . Given that Suresh had suggested in response to your -1 on the merge vote that we continue working on this feedback on trunk, and given that you then removed your -1 and the vote passed, it seems to me like this branch was ready for a merge from trunk. Yes, there are a few more small cleanup sub-tasks to work on, but that can be done on trunk. Let's continue the work there.
          Hide
          Suresh Srinivas added a comment -

          I think as long as main issues are done and remaining issues do not render the trunk non functional, I am fine merging it. Lets get all the related subtasks resolved before merging this into branch-2.

          Show
          Suresh Srinivas added a comment - I think as long as main issues are done and remaining issues do not render the trunk non functional, I am fine merging it. Lets get all the related subtasks resolved before merging this into branch-2.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #182 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/182/)
          HDFS-347. DFS read performance suboptimal when client co-located on nodes with data. Contributed by Colin Patrick McCabe. (Revision 1467538)

          Result = SUCCESS
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467538
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/pom.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/CMakeLists.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketInputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketOutputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocket.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TemporarySocketDirectory.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TestDomainSocket.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocalLegacy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DomainSocketFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/FileInputStreamCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/SocketCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/BasicInetPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/EncryptedPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/NioInetPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/Peer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/PeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/TcpPeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/DataTransferProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Receiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Sender.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/JspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/ClientDatanodeProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/datatransfer.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/ShortCircuitLocalReads.apt.vm
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BlockReaderTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocalLegacy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientBlockVerification.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestConnCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDataTransferKeepalive.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDisableConnCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileInputStreamCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelLocalRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelReadUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitLegacyRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadNoChecksum.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadUnCached.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelUnixDomainRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPeerCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestShortCircuitLocalRead.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSocketCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockTokenWithDFS.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #182 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/182/ ) HDFS-347 . DFS read performance suboptimal when client co-located on nodes with data. Contributed by Colin Patrick McCabe. (Revision 1467538) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467538 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/pom.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/CMakeLists.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketInputStream.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketOutputStream.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocket.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.h /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TemporarySocketDirectory.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TestDomainSocket.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderFactory.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocalLegacy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DomainSocketFactory.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/FileInputStreamCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/SocketCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/BasicInetPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/EncryptedPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/NioInetPeer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/Peer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/PeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/TcpPeerServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/DataTransferProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Receiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Sender.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/JspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/ClientDatanodeProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/datatransfer.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/ShortCircuitLocalReads.apt.vm /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BlockReaderTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockReaderLocalLegacy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientBlockVerification.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestConnCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDataTransferKeepalive.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDisableConnCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileInputStreamCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelLocalRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelReadUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitLegacyRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadNoChecksum.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelShortCircuitReadUnCached.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestParallelUnixDomainRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPeerCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestShortCircuitLocalRead.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSocketCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockTokenWithDFS.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-project/src/site/site.xml
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1371 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1371/)
          HDFS-347. DFS read performance suboptimal when client co-located on nodes with data. Contributed by Colin Patrick McCabe. (Revision 1467538)

          Result = FAILURE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467538
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/pom.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/CMakeLists.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketInputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketOutputStream.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/unix/DomainSocket.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TemporarySocketDirectory.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/unix/TestDomainSocket.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocalLegacy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DomainSocketFactory.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/FileInputStreamCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/SocketCache.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/BasicInetPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DomainPeerServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/EncryptedPeer.java