Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-374

Add api to read dictionary from each column chunk for predicate pushdown

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      Parquet files's dictionary could be used for predicate pushdown
      eg.
      SQL query:
      select * from table where column = 10;
      could skip reading the whole row group if the dictionary for column has values [5, 11, 17, 20]
      This could save IO and improve performance.

      We implemented predicate pushdown using dictionary in Presto for parquet files, and benchmark shows up to 40X speedup for selective queries.

      Need to add an api to ParquetFileReader, so that it returns dictionaries for requested columns.
      If the column is not dictionary encoded in this row group, return null.
      If the not all column pages are dictionary encoded in this row group, return null.

        Issue Links

          Activity

          Show
          zhenxiao Zhenxiao Luo added a comment - https://github.com/apache/parquet-mr/pull/270
          Hide
          rdblue Ryan Blue added a comment -

          I'm marking this as "Won't fix" because PARQUET-384 includes the proposed API for accessing dictionaries.

          Show
          rdblue Ryan Blue added a comment - I'm marking this as "Won't fix" because PARQUET-384 includes the proposed API for accessing dictionaries.

            People

            • Assignee:
              zhenxiao Zhenxiao Luo
              Reporter:
              zhenxiao Zhenxiao Luo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development