Uploaded image for project: 'IMPALA'
  2. IMPALA-8525

preads should use hdfsPreadFully rather than hdfsPread



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 3.4.0
    • Backend
    • None
    • ghx-label-1


      Impala preads (only enabled if use_hdfs_pread is true) use the hdfsPread API from libhdfs, which ultimately invokes PositionedReadable#read(long position, byte[] buffer, int offset, int length) in the HDFS-client.

      PositionedReadable also exposes the method readFully(long position, byte[] buffer, int offset, int length). The difference is that #read will "Read up to the specified number of bytes" whereas #readFully will "Read the specified number of bytes". So there is no guarantee that #read will read all of the request bytes.

      Impala calls hdfsPread inside hdfs-file-reader.cc and invokes it inside a while loop until all the requested bytes have been read from the file. This can cause a few performance issues:

      (1) if the underlying FileSystem does not support ByteBuffer reads (HDFS-2834) (e.g. S3A does not support this feature) then hdfsPread will allocate a Java array equal in size to specified length of the buffer; the call to PositionedReadable#read may only fill up the buffer partially; Impala will repeat the call to hdfsPread since the buffer was not filled, which will cause another large array allocation; this can result in a lot of wasted time doing unnecessary array allocations

      (2) given that Impala calls hdfsPread in a while loop, there is no point in continuously calling hdfsPread when a single call to hdfsPreadFully will achieve the same thing (this doesn't actually affect performance much, but is unnecessary)

      Prior solutions to this problem have been to introduce a "chunk-size" to Impala reads (https://gerrit.cloudera.org/#/c/63/ - S3: DiskIoMgr related changes for S3). However, with the migration to hdfsPreadFully the chunk-size is no longer necessary.

      Furthermore, preads are most effective when the data is read all at once (e.g. in 8 MB chunks as specified by read_size) rather than in smaller chunks (typically 128K). For example, DFSInputStream#read(long position, byte[] buffer, int offset, int length) opens up remote block readers with a byte range determined by the value of length passed into the #read call. Similarly, S3AInputStream#readFully will issue an HTTP GET request with the size of the read specified by the given length (although fadvise must be set to RANDOM for this to work).

      This work is dependent on exposing readFully via libhdfs first: HDFS-14564


        Issue Links



              stakiar Sahil Takiar
              stakiar Sahil Takiar
              0 Vote for this issue
              5 Start watching this issue