Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-128

Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • 1.6.0
    • parquet-mr
    • None

    Description

      The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it.
      We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly.

      Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader.

      The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns

      Attachments

        Activity

          People

            Unassigned Unassigned
            saucam Yash Datta
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: