Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18072 Über-JIRA: abfs phase III: Hadoop 3.4.0 features & fixes
  3. HADOOP-18971

ABFS: Enable Footer Read Optimizations with Appropriate Footer Read Buffer Size

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Footer Read Optimization was introduced to Hadoop azure in this Jira: https://issues.apache.org/jira/browse/HADOOP-17347
      and was kept disabled by default.
      This PR is to enable footer reads by default based on the results of analysis performed as below:

      In our scale workload analysis, it was found that workloads working with Parquet (or for that matter OCR etc.) have a lot of footer reads. Footer reads here refers to the read operations done by workload to get the metadata of the parquet file which is required to understand where the actual data resides in the parquet.
      This whole process takes place in 3 steps:

      1. Workload reads the last 8 bytes of parquet file to get the offset and size of the metadata which is present just above these 8 bytes.
      2. Using that offset, workload reads the metadata to get the exact offset and length of data which it wants to read.
      3. Workload performs the final read operation to get the data it wants to use for its purpose.

      Here the first two steps are metadata reads that can be combined into a single footer read. When workload tries to read certain last few bytes of data (let's say this value is footer size), driver will intelligently read some extra bytes above the footer size to cater to the next read which is going to come.

      Q. What is the footer size of file?
      A: 16KB. Any read request trying to get the data within last 16KB of the file will qualify for whole footer read. This value is enough to cater to all types of files including parquet, OCR, etc.

      Q. What is the buffer size to read when reading the footer?
      A. Let's call this footer read buffer size. Prior to this PR footer read buffer size was same as read buffer size (default 4MB). It was found that for most of the workload required footer size was only 256KB. i.e. For almost all parquet files metadata for that file was found to be within last 256KBs. Keeping this in mind it does not make sense to read whole buffer length of 4MB as a part of footer read. Moreover, reading larger data than require incur additional costs in terms of server and network latencies. Based on this and extensive experimentation it was observed that footer read buffer size of 512KB is ideal for almost all the workloads running on parquet, OCR, etc.

      Following configuration was introduced to configure the footer read buffer size:
      fs.azure.footer.read.request.size: default 512 KB.

      Quantitative Stats: For a workload running on parquet files the number of read requests got reduced by 2.3M down from 20M. That means around 10% reduction in overall TPS.

      Attachments

        Issue Links

          Activity

            People

              anujmodi Anuj Modi
              anujmodi Anuj Modi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: