Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-306

Improve alignment between row groups and HDFS blocks

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      Row groups should not span HDFS blocks to avoid remote reads. There are 3 things we can use to avoid this:
      1. Set the next row group's size to the remaining bytes in the current HDFS block
      2. Use HDFS-3689, variable-length HDFS blocks, when available
      3. Pad after row groups close to the block boundary to start the next row group at the start of the next block

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rdblue Ryan Blue
                Reporter:
                rdblue Ryan Blue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: