Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5795

Filter pushdown for parquet handles multi rowgroup file

    XMLWordPrintableJSON

Details

    Description

      DRILL-1950 implemented the filter pushdown for parquet file but only in the case of one rowgroup per parquet file. In the case of multiple rowgroups per files, it detects that the rowgroup can be pruned but then tell to the drillbit to read the whole file which leads to performance issue.

      Having multiple rowgroup per file helps to handle partitioned dataset and still read only the relevant subset of data without ending with more file than really needed.

      Let's say for instance you have a Parquet file composed of RG1 and RG2 with only one column a. Min/max in RG1 are 1-2 and min/max in RG2 are 2-3.
      If I do "select a from file where a=3", today it will read the whole file, with the patch it will only read RG2.

      For documentation
      Support / Other section in https://drill.apache.org/docs/parquet-filter-pushdown/ should be updated.
      After the fix files with multiple row groups will be supported.

      Attachments

        1. multirowgroup_overlap.parquet
          0.4 kB
          Damien Profeta

        Issue Links

          Activity

            People

              dprofeta Damien Profeta
              dprofeta Damien Profeta
              Parth Chandra Parth Chandra
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: