Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3051

A zero-copy ScatterGatherRead api from FSDataInputStream

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: hdfs-client, performance
    • Labels:
      None

      Description

      It will be nice if we can get a new API from FSDtaInputStream that allows for zero-copy read for hdfs readers.

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment - - edited

          The new API can be a new method in FSDataInputStream.

          public List<ByteBuffer> readFullyScatterGather(long position, int length)
              throws IOException {
              return ((PositionedReadable)in).readFullyScatterGather(position, length);
            }
          
          

          It will allow HDFS to return mapped byte buffers or direct byte buffers.

          Show
          dhruba borthakur added a comment - - edited The new API can be a new method in FSDataInputStream. public List<ByteBuffer> readFullyScatterGather( long position, int length) throws IOException { return ((PositionedReadable)in).readFullyScatterGather(position, length); } It will allow HDFS to return mapped byte buffers or direct byte buffers.
          Hide
          Todd Lipcon added a comment -

          One interesting thing to take note of: in Linux prior to 2.6.37, the page fault handler for file mappings actually held the mmap semaphore exclusively, preventing other threads from modifying page mappings (or starting threads). So doing mmapped IO may have some downsides as well, especially on older kernels. Not sure if this issue is addressed in RHEL 6 or not. The Linux git hash is d065bd810b6deb67d4897a14bfe21f8eb526ba99, see also http://help.lockergnome.com/linux/PATCH-V2-Reduce-mmap_sem-hold-times-file-backed-page-faults--ftopict527005.html

          Show
          Todd Lipcon added a comment - One interesting thing to take note of: in Linux prior to 2.6.37, the page fault handler for file mappings actually held the mmap semaphore exclusively, preventing other threads from modifying page mappings (or starting threads). So doing mmapped IO may have some downsides as well, especially on older kernels. Not sure if this issue is addressed in RHEL 6 or not. The Linux git hash is d065bd810b6deb67d4897a14bfe21f8eb526ba99, see also http://help.lockergnome.com/linux/PATCH-V2-Reduce-mmap_sem-hold-times-file-backed-page-faults--ftopict527005.html
          Hide
          Colin Patrick McCabe added a comment -

          Hi Dhruba,

          This sounds interesting. One thing I don't completely understand about your proposed API is whether you will have multiple (position, length) pairs as inputs. Traditionally, scatter-gather implies being able to read multiple locations at once, like in ''preadv(2)''. However, I only see one position, length argument in your readFullyScatterGather function.

          Also, it seems to me that by mmapping at a fixed address, you could create a single contiguous buffer rather than forcing the user to deal with multiple buffers for a single HDFS file.

          C.

          Show
          Colin Patrick McCabe added a comment - Hi Dhruba, This sounds interesting. One thing I don't completely understand about your proposed API is whether you will have multiple (position, length) pairs as inputs. Traditionally, scatter-gather implies being able to read multiple locations at once, like in ''preadv(2)''. However, I only see one position, length argument in your readFullyScatterGather function. Also, it seems to me that by mmapping at a fixed address, you could create a single contiguous buffer rather than forcing the user to deal with multiple buffers for a single HDFS file. C.
          Hide
          Colin Patrick McCabe added a comment -

          Todd said:

          One interesting thing to take note of: in Linux prior to 2.6.37, the page fault handler for file mappings actually held the mmap semaphore exclusively, preventing other threads from modifying page mappings (or starting threads). So doing mmapped IO may have some downsides as well, especially on older kernels. Not sure if this issue is addressed in RHEL 6 or not. The Linux git hash is d065bd810b6deb67d4897a14bfe21f8eb526ba99, see also http://help.lockergnome.com/linux/PATCH-V2-Reduce-mmap_sem-hold-times-file-backed-page-faults--ftopict527005.html

          Good point.

          At least in theory, you can create threads on Linux without calling mmap. You just can't create pthreads (note the "p"). I wonder what HotSpot does exactly to create threads?

          Show
          Colin Patrick McCabe added a comment - Todd said: One interesting thing to take note of: in Linux prior to 2.6.37, the page fault handler for file mappings actually held the mmap semaphore exclusively, preventing other threads from modifying page mappings (or starting threads). So doing mmapped IO may have some downsides as well, especially on older kernels. Not sure if this issue is addressed in RHEL 6 or not. The Linux git hash is d065bd810b6deb67d4897a14bfe21f8eb526ba99, see also http://help.lockergnome.com/linux/PATCH-V2-Reduce-mmap_sem-hold-times-file-backed-page-faults--ftopict527005.html Good point. At least in theory, you can create threads on Linux without calling mmap. You just can't create pthreads (note the "p"). I wonder what HotSpot does exactly to create threads?
          Hide
          dhruba borthakur added a comment -

          Hi colin, I agree that scatter/gather typically refers to multiple input tuples of (position, length). Yes, we can extend the api to include that.

          The reason my original proposal did not include that was because I was mostly targetting this api to reduce the number of buffer copies.

          Show
          dhruba borthakur added a comment - Hi colin, I agree that scatter/gather typically refers to multiple input tuples of (position, length). Yes, we can extend the api to include that. The reason my original proposal did not include that was because I was mostly targetting this api to reduce the number of buffer copies.
          Hide
          Tim Broberg added a comment -

          This interface adds some complexity to the ZeroCopyCompressor interface, HADOOP-8148. Debugging traversal of a list of objects across JNI is likely to take some work.

          Are we approaching any kind of consensus on whether to incorporate this or not?

          Also, how large are the individual buffers in these lists, typically?

          Show
          Tim Broberg added a comment - This interface adds some complexity to the ZeroCopyCompressor interface, HADOOP-8148 . Debugging traversal of a list of objects across JNI is likely to take some work. Are we approaching any kind of consensus on whether to incorporate this or not? Also, how large are the individual buffers in these lists, typically?

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development