Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-347

DFS read performance suboptimal when client co-located on nodes with data

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected.

      After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separahitting te I/O operation for each chunk. The result is that the single request for a 64-MB block ends up the disk as over a thousand smaller requests for 64-KB each.

      Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation.

      I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

      Attachments

        1. 2013.01.28.design.pdf
          72 kB
          Colin McCabe
        2. 2013.01.31.consolidated.patch
          402 kB
          Colin McCabe
        3. 2013.01.31.consolidated2.patch
          402 kB
          Colin McCabe
        4. 2013.02.15.consolidated4.patch
          366 kB
          Colin McCabe
        5. 2013-04-01-jenkins.patch
          424 kB
          Colin McCabe
        6. a.patch
          413 kB
          Tsz-wo Sze
        7. all.tsv
          12 kB
          Todd Lipcon
        8. BlockReaderLocal1.txt
          28 kB
          Dhruba Borthakur
        9. full.patch
          379 kB
          Colin McCabe
        10. HADOOP-4801.1.patch
          14 kB
          George Porter
        11. HADOOP-4801.2.patch
          15 kB
          George Porter
        12. HADOOP-4801.3.patch
          15 kB
          George Porter
        13. HDFS-347.016.patch
          250 kB
          Colin McCabe
        14. HDFS-347.017.clean.patch
          113 kB
          Colin McCabe
        15. HDFS-347.017.patch
          247 kB
          Colin McCabe
        16. HDFS-347.018.clean.patch
          112 kB
          Colin McCabe
        17. HDFS-347.018.patch2
          246 kB
          Colin McCabe
        18. HDFS-347.019.patch
          247 kB
          Colin McCabe
        19. HDFS-347.020.patch
          245 kB
          Colin McCabe
        20. HDFS-347.021.patch
          248 kB
          Colin McCabe
        21. HDFS-347.022.patch
          249 kB
          Colin McCabe
        22. HDFS-347.024.patch
          240 kB
          Colin McCabe
        23. HDFS-347.025.patch
          240 kB
          Colin McCabe
        24. HDFS-347.026.patch
          240 kB
          Colin McCabe
        25. HDFS-347.027.patch
          261 kB
          Colin McCabe
        26. HDFS-347.029.patch
          259 kB
          Colin McCabe
        27. HDFS-347.030.patch
          260 kB
          Colin McCabe
        28. HDFS-347.033.patch
          321 kB
          Colin McCabe
        29. HDFS-347.035.patch
          343 kB
          Colin McCabe
        30. hdfs-347.png
          16 kB
          Todd Lipcon
        31. hdfs-347.txt
          26 kB
          Todd Lipcon
        32. HDFS-347-016_cleaned.patch
          109 kB
          Colin McCabe
        33. HDFS-347-branch-20-append.txt
          27 kB
          ryan rawson
        34. hdfs-347-merge.txt
          357 kB
          Todd Lipcon
        35. hdfs-347-merge.txt
          349 kB
          Todd Lipcon
        36. hdfs-347-merge.txt
          346 kB
          Todd Lipcon
        37. local-reads-doc
          14 kB
          Todd Lipcon

        Issue Links

        1.
        Encapsulate arguments to BlockReaderFactory in a class Sub-task Resolved Colin McCabe Actions
        2.
        Encapsulate connections to peers in Peer and PeerServer classes Sub-task Resolved Colin McCabe Actions
        3.
        Create DomainSocket and DomainPeer and associated unit tests Sub-task Resolved Colin McCabe Actions
        4.
        BlockReaderLocal should use passed file descriptors rather than paths Sub-task Resolved Colin McCabe Actions
        5.
        DomainSocket should throw AsynchronousCloseException when appropriate Sub-task Resolved Colin McCabe Actions
        6.
        Bypass UNIX domain socket unit tests when they cannot be run Sub-task Resolved Colin McCabe Actions
        7.
        DFSInputStream#getBlockReader: last retries should ignore the cache Sub-task Resolved Colin McCabe Actions
        8.
        Fix bug in DomainSocket path validation Sub-task Resolved Colin McCabe Actions
        9.
        some small DomainSocket fixes: avoid findbugs warning, change log level, etc. Sub-task Resolved Colin McCabe Actions
        10.
        change dfs.datanode.domain.socket.path to dfs.domain.socket.path Sub-task Resolved Colin McCabe Actions
        11.
        HDFS-347: fix case where local reads get disabled incorrectly Sub-task Resolved Colin McCabe Actions
        12.
        HDFS-347: increase default FileInputStreamCache size Sub-task Resolved Todd Lipcon Actions
        13.
        make TestPeerCache not flaky Sub-task Resolved Colin McCabe Actions
        14.
        TestDomainSocket fails when system umask is set to 0002 Sub-task Resolved Colin McCabe Actions
        15.
        avoid annoying log message when dfs.domain.socket.path is not set Sub-task Resolved Colin McCabe Actions
        16.
        Make a simple doc to describe the usage and design of the shortcircuit read feature Sub-task Resolved Colin McCabe Actions
        17.
        DataNode: don't create domain socket unless we need it Sub-task Resolved Colin McCabe Actions
        18.
        HDFS-347: style cleanups Sub-task Resolved Colin McCabe Actions
        19.
        HDFS-347: DN should chmod socket path a+w Sub-task Closed Colin McCabe Actions
        20.
        DFSClient: don't create a domain socket unless we need it Sub-task Resolved Colin McCabe Actions
        21.
        allow use of legacy blockreader Sub-task Resolved Colin McCabe Actions
        22.
        fix various bugs in short circuit read Sub-task Closed Colin McCabe Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            cmccabe Colin McCabe
            gmporter George Porter
            Votes:
            12 Vote for this issue
            Watchers:
            111 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment