Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-347

DFS read performance suboptimal when client co-located on nodes with data

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected.

      After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separahitting te I/O operation for each chunk. The result is that the single request for a 64-MB block ends up the disk as over a thousand smaller requests for 64-KB each.

      Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation.

      I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

      Attachments

        1. a.patch
          413 kB
          Tsz-wo Sze
        2. 2013-04-01-jenkins.patch
          424 kB
          Colin McCabe
        3. 2013.02.15.consolidated4.patch
          366 kB
          Colin McCabe
        4. 2013.01.31.consolidated2.patch
          402 kB
          Colin McCabe
        5. 2013.01.31.consolidated.patch
          402 kB
          Colin McCabe
        6. 2013.01.28.design.pdf
          72 kB
          Colin McCabe
        7. full.patch
          379 kB
          Colin McCabe
        8. hdfs-347-merge.txt
          357 kB
          Todd Lipcon
        9. hdfs-347-merge.txt
          349 kB
          Todd Lipcon
        10. hdfs-347-merge.txt
          346 kB
          Todd Lipcon
        11. HDFS-347.035.patch
          343 kB
          Colin McCabe
        12. HDFS-347.033.patch
          321 kB
          Colin McCabe
        13. HDFS-347.030.patch
          260 kB
          Colin McCabe
        14. HDFS-347.029.patch
          259 kB
          Colin McCabe
        15. HDFS-347.027.patch
          261 kB
          Colin McCabe
        16. HDFS-347.026.patch
          240 kB
          Colin McCabe
        17. HDFS-347.025.patch
          240 kB
          Colin McCabe
        18. HDFS-347.024.patch
          240 kB
          Colin McCabe
        19. HDFS-347.022.patch
          249 kB
          Colin McCabe
        20. HDFS-347.021.patch
          248 kB
          Colin McCabe
        21. HDFS-347.020.patch
          245 kB
          Colin McCabe
        22. HDFS-347.019.patch
          247 kB
          Colin McCabe
        23. HDFS-347.018.patch2
          246 kB
          Colin McCabe
        24. HDFS-347.018.clean.patch
          112 kB
          Colin McCabe
        25. HDFS-347.017.patch
          247 kB
          Colin McCabe
        26. HDFS-347.017.clean.patch
          113 kB
          Colin McCabe
        27. HDFS-347.016.patch
          250 kB
          Colin McCabe
        28. HDFS-347-016_cleaned.patch
          109 kB
          Colin McCabe
        29. HDFS-347-branch-20-append.txt
          27 kB
          ryan rawson
        30. BlockReaderLocal1.txt
          28 kB
          Dhruba Borthakur
        31. hdfs-347.png
          16 kB
          Todd Lipcon
        32. all.tsv
          12 kB
          Todd Lipcon
        33. hdfs-347.txt
          26 kB
          Todd Lipcon
        34. local-reads-doc
          14 kB
          Todd Lipcon
        35. HADOOP-4801.3.patch
          15 kB
          George Porter
        36. HADOOP-4801.2.patch
          15 kB
          George Porter
        37. HADOOP-4801.1.patch
          14 kB
          George Porter

        Issue Links

          1.
          Encapsulate arguments to BlockReaderFactory in a class Sub-task Resolved Colin McCabe
          2.
          Encapsulate connections to peers in Peer and PeerServer classes Sub-task Resolved Colin McCabe
          3.
          Create DomainSocket and DomainPeer and associated unit tests Sub-task Resolved Colin McCabe
          4.
          BlockReaderLocal should use passed file descriptors rather than paths Sub-task Resolved Colin McCabe
          5.
          DomainSocket should throw AsynchronousCloseException when appropriate Sub-task Resolved Colin McCabe
          6.
          Bypass UNIX domain socket unit tests when they cannot be run Sub-task Resolved Colin McCabe
          7.
          DFSInputStream#getBlockReader: last retries should ignore the cache Sub-task Resolved Colin McCabe
          8.
          Fix bug in DomainSocket path validation Sub-task Resolved Colin McCabe
          9.
          some small DomainSocket fixes: avoid findbugs warning, change log level, etc. Sub-task Resolved Colin McCabe
          10.
          change dfs.datanode.domain.socket.path to dfs.domain.socket.path Sub-task Resolved Colin McCabe
          11.
          HDFS-347: fix case where local reads get disabled incorrectly Sub-task Resolved Colin McCabe
          12.
          HDFS-347: increase default FileInputStreamCache size Sub-task Resolved Todd Lipcon
          13.
          make TestPeerCache not flaky Sub-task Resolved Colin McCabe
          14.
          TestDomainSocket fails when system umask is set to 0002 Sub-task Resolved Colin McCabe
          15.
          avoid annoying log message when dfs.domain.socket.path is not set Sub-task Resolved Colin McCabe
          16.
          Make a simple doc to describe the usage and design of the shortcircuit read feature Sub-task Resolved Colin McCabe
          17.
          DataNode: don't create domain socket unless we need it Sub-task Resolved Colin McCabe
          18.
          HDFS-347: style cleanups Sub-task Resolved Colin McCabe
          19.
          HDFS-347: DN should chmod socket path a+w Sub-task Closed Colin McCabe
          20.
          DFSClient: don't create a domain socket unless we need it Sub-task Resolved Colin McCabe
          21.
          allow use of legacy blockreader Sub-task Resolved Colin McCabe
          22.
          fix various bugs in short circuit read Sub-task Closed Colin McCabe

          Activity

            People

              cmccabe Colin McCabe
              gmporter George Porter
              Votes:
              12 Vote for this issue
              Watchers:
              111 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: