Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1381

Add merge blocks command to parquet-tools

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 1.10.0
    • None
    • parquet-mr

    Description

      Current implementation of merge command in parquet-tools doesn't merge row groups, just places one after the other. Add API and command option to be able to merge small blocks into larger ones up to specified size limit.

      Implementation details:

      Blocks are not reordered not to break possible initial predicate pushdown optimizations.
      Blocks are not divided to fit upper bound perfectly.
      This is an intentional performance optimization.
      This gives an opportunity to form new blocks by coping full content of smaller blocks by column, not by row.

      Examples:
      1. Input files with blocks sizes:
        [128 | 35], [128 | 40], [120]

        Expected output file blocks sizes:
        {{merge }}

        [128 | 35 | 128 | 40 | 120]
        

        merge -b

        [128 | 35 | 128 | 40 | 120]
        

        {{merge -b -l 256 }}

        [163 | 168 | 120]
        
      1. Input files with blocks sizes:
        [128 | 35], [40], [120], [6] 

        Expected output file blocks sizes:
        merge

        [128 | 35 | 40 | 120 | 6] 
        

        merge -b

        [128 | 75 | 126] 
        

        merge -b -l 256

        [203 | 126]

      Attachments

        Activity

          People

            Katya Ekaterina Galieva
            Katya Ekaterina Galieva
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: