Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1034

Enhance datanode to read data and checksum file in parallel

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In the current HDFS implementation, a read of a block issued to the datanode results in a disk access to the checksum file followed by a disk access to the checksum file. It would be nice to be able to do these two IOs in parallel to reduce read latency.

        Issue Links

          Activity

          Hide
          Jay Booth added a comment -

          Wouldn't this preclude the use of transferTo for transfer from the main block file? The packet header and sums need to be sent before transferTo is invoked, otherwise things would all be jumbled up together.

          Show
          Jay Booth added a comment - Wouldn't this preclude the use of transferTo for transfer from the main block file? The packet header and sums need to be sent before transferTo is invoked, otherwise things would all be jumbled up together.
          Hide
          Todd Lipcon added a comment -

          A bit of manual readahead on the checksum file could theoretically have the same benefit without the problem Jay mentioned. The Checksum file is opened with a buffer, right? In that case, do we expect to see a big benefit?

          Show
          Todd Lipcon added a comment - A bit of manual readahead on the checksum file could theoretically have the same benefit without the problem Jay mentioned. The Checksum file is opened with a buffer, right? In that case, do we expect to see a big benefit?
          Hide
          Zlatin Balevsky added a comment -

          How complicated would it be to store the checksum file on a separate mount point? In JBOD configurations this will enable both reads to happen simultaneously.

          Show
          Zlatin Balevsky added a comment - How complicated would it be to store the checksum file on a separate mount point? In JBOD configurations this will enable both reads to happen simultaneously.
          Hide
          Todd Lipcon added a comment -

          Only scary thing about moving checksums to a different mountpoint is that the checksum file's metadata will be on a different journal than the data files. This might end up fine, but it's a little nervewracking in terms of what kind of consistency we get out of the FS - could cause some very subtle bugs.

          Have we seen these issues to be a significant bottleneck?

          Show
          Todd Lipcon added a comment - Only scary thing about moving checksums to a different mountpoint is that the checksum file's metadata will be on a different journal than the data files. This might end up fine, but it's a little nervewracking in terms of what kind of consistency we get out of the FS - could cause some very subtle bugs. Have we seen these issues to be a significant bottleneck?
          Hide
          Zlatin Balevsky added a comment -

          The only possible bottleneck is the extra disk seek which may or may not be a big deal. Probably for HBase-type workloads. There are many ways around that including but not limited to:

          a) prepending a copy of the checksum file to the block file while keeping the separate copy intact for off-thread verification after the transfer starts
          b) using some ext4-extents jni magic
          ... ?

          Show
          Zlatin Balevsky added a comment - The only possible bottleneck is the extra disk seek which may or may not be a big deal. Probably for HBase-type workloads. There are many ways around that including but not limited to: a) prepending a copy of the checksum file to the block file while keeping the separate copy intact for off-thread verification after the transfer starts b) using some ext4-extents jni magic ... ?
          Hide
          Todd Lipcon added a comment -

          In practice I don't imagine the extra disk seek for checksums is a problem for HBase - since the checksum file is relatively small, my guess is that it stays hot in the linux buffer cache and therefore doesn't represent any disk access. Would certainly be interesting to run blktrace on a heavily loaded hbase datanode to see if this is true, though!

          Show
          Todd Lipcon added a comment - In practice I don't imagine the extra disk seek for checksums is a problem for HBase - since the checksum file is relatively small, my guess is that it stays hot in the linux buffer cache and therefore doesn't represent any disk access. Would certainly be interesting to run blktrace on a heavily loaded hbase datanode to see if this is true, though!

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development