Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15692

Improve fuse_dfs read performace

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • fuse-dfs
    • None

    Description

      Currently fuse_dfs uses a prefetch buffer to read from HDFS via libhdfs' pread method.

      The algorithm inside fuse_read.c in short does the following:
      if the rdbuffer size is less then the buffer provided
      then
        reads directly to the buffer
      else
        grab lock
          if the preftch buffer does not have more data
          then
            fills the prefetch buffer
          endif
          fills the supplied buffer via memcpy from the prefetch buffer
        release lock
      endif

      It would be nice to have a background thread and double prefetch buffers, so while one buffer serves the reads coming from the local client, the other can prefetch the data, with that we can improve the read speed, especially with EC encoded files.

      According to some measurements I did, if I increase the read buffer, there is a significant change in runtime, with 64MB the runtime is really closer to HDFS by a large margin. Interestingly 128MB as the buffer size does not perform well, but 256MB is even more closer to what the dfs client can provide. (16 vs 18 seconds with rep3 files, while in par with ec encoded files dfs vs fuse)

      So it seems it is worth to stream continuously a larger chunk of data, at least with pread, but in case we have a separate fetching thread and double buffering, we don't even need positioned reads, simply just continuous streaming of data with read.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pifta István Fajth
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: