Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6865

Query returns wrong result when filter pruning happens

    XMLWordPrintableJSON

Details

    Description

      In DRILL-5796 was implemented removing the filter from the plan when some (or all) row groups of parquet table fully match the filter.

      For the case when filter has some predicates which parquet filter predicate does not support, they can be omitted for some cases from the resulting filter predicate. When row groups fully match predicates which left in the filter, the whole filter is removed from the plan and the wrong result is returned.

      Example of the query for reproducing this bug:

      create table dfs.tmp.`multi/t1` as select * from cp.`tpch/nation.parquet` where n_nationkey > 5;
      create table dfs.tmp.`multi/t2` as select * from cp.`tpch/nation.parquet` where n_nationkey < 5;
      select * from dfs.tmp.`multi` where n_nationkey > 5 and n_nationkey like '%10%';
      

      returns

      +-------+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
      | dir0  | n_nationkey  |     n_name      | n_regionkey  |                                                      n_comment                                                      |
      +-------+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
      | t1    | 6            | FRANCE          | 3            | refully final requests. regular, ironi                                                                              |
      | t1    | 7            | GERMANY         | 3            | l platelets. regular accounts x-ray: unusual, regular acco                                                          |
      | t1    | 8            | INDIA           | 2            | ss excuses cajole slyly across the packages. deposits print aroun                                                   |
      | t1    | 9            | INDONESIA       | 2            |  slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull  |
      | t1    | 10           | IRAN            | 4            | efully alongside of the slyly final dependencies.                                                                   |
      | t1    | 11           | IRAQ            | 4            | nic deposits boost atop the quickly final requests? quickly regula                                                  |
      | t1    | 12           | JAPAN           | 2            | ously. final, express gifts cajole a                                                                                |
      | t1    | 13           | JORDAN          | 4            | ic deposits are blithely about the carefully regular pa                                                             |
      | t1    | 14           | KENYA           | 0            |  pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t                       |
      | t1    | 15           | MOROCCO         | 0            | rns. blithely bold courts among the closely regular packages use furiously bold platelets?                          |
      | t1    | 16           | MOZAMBIQUE      | 0            | s. ironic, unusual asymptotes wake blithely r                                                                       |
      | t1    | 17           | PERU            | 1            | platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun          |
      | t1    | 18           | CHINA           | 2            | c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos                         |
      | t1    | 19           | ROMANIA         | 3            | ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account     |
      | t1    | 20           | SAUDI ARABIA    | 4            | ts. silent requests haggle. closely express packages sleep across the blithely                                      |
      | t1    | 21           | VIETNAM         | 2            | hely enticingly express accounts. even, final                                                                       |
      | t1    | 22           | RUSSIA          | 3            |  requests against the platelets use never according to the quickly regular pint                                     |
      | t1    | 23           | UNITED KINGDOM  | 3            | eans boost carefully special requests. accounts are. carefull                                                       |
      | t1    | 24           | UNITED STATES   | 1            | y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be      |
      +-------+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
      

      but single row should be returned:

      +-------+--------------+---------+--------------+-----------------------------------------------------+
      | dir0  | n_nationkey  | n_name  | n_regionkey  |                      n_comment                      |
      +-------+--------------+---------+--------------+-----------------------------------------------------+
      | t1    | 10           | IRAN    | 4            | efully alongside of the slyly final dependencies.   |
      +-------+--------------+---------+--------------+-----------------------------------------------------+
      

      Filter is removed from the plan, but it contains a predicate which wasn't applied:

      00-00    Screen : rowType = RecordType(DYNAMIC_STAR **): rowcount = 19.0, cumulative cost = {77.9 rows, 115.9 cpu, 38.0 io, 0.0 network, 0.0 memory}, id = 400
      00-01      Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 19.0, cumulative cost = {76.0 rows, 114.0 cpu, 38.0 io, 0.0 network, 0.0 memory}, id = 399
      00-02        Project(T1¦¦**=[$0]) : rowType = RecordType(DYNAMIC_STAR T1¦¦**): rowcount = 19.0, cumulative cost = {57.0 rows, 95.0 cpu, 38.0 io, 0.0 network, 0.0 memory}, id = 398
      00-03          Project(T1¦¦**=[$0], n_nationkey=[$1]) : rowType = RecordType(DYNAMIC_STAR T1¦¦**, ANY n_nationkey): rowcount = 19.0, cumulative cost = {38.0 rows, 76.0 cpu, 38.0 io, 0.0 network, 0.0 memory}, id = 397
      00-04            Scan(table=[[dfs, tmp, multi]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/multi/t1/0_0_0.parquet]], selectionRoot=file:/tmp/multi, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`, `n_nationkey`]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY n_nationkey): rowcount = 19.0, cumulative cost = {19.0 rows, 38.0 cpu, 38.0 io, 0.0 network, 0.0 memory}, id = 396
      

      Additionally, a filter is not removed from the plan when parquet table with single row group is queried:

      create table dfs.tmp.`singleRowGroupTable` as select * from cp.`tpch/nation.parquet`;
      explain plan for select * from dfs.tmp.`singleRowGroupTable` where n_nationkey > -1;
      

      returns plan

      00-00    Screen
      00-01      Project(**=[$0])
      00-02        Project(T0¦¦**=[$0])
      00-03          SelectionVectorRemover
      00-04            Filter(condition=[>($1, -1)])
      00-05              Project(T0¦¦**=[$0], n_nationkey=[$1])
      00-06                Scan(table=[[dfs, tmp, singleRowGroupTable]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/singleRowGroupTable]], selectionRoot=file:/tmp/singleRowGroupTable, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`, `n_nationkey`]]])
      

      Also, for the case when a table has multiple files, and filter matches all the table, it is not removed from the plan:

      select * from dfs.tmp.`multi` where n_nationkey > -1;
      

      has plan

      00-00    Screen : rowType = RecordType(DYNAMIC_STAR **): rowcount = 12.0, cumulative cost = {109.2 rows, 277.2 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 196
      00-01      Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 12.0, cumulative cost = {108.0 rows, 276.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 195
      00-02        Project(T0¦¦**=[$0]) : rowType = RecordType(DYNAMIC_STAR T0¦¦**): rowcount = 12.0, cumulative cost = {96.0 rows, 264.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 194
      00-03          SelectionVectorRemover : rowType = RecordType(DYNAMIC_STAR T0¦¦**, ANY n_nationkey): rowcount = 12.0, cumulative cost = {84.0 rows, 252.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 193
      00-04            Filter(condition=[>($1, -1)]) : rowType = RecordType(DYNAMIC_STAR T0¦¦**, ANY n_nationkey): rowcount = 12.0, cumulative cost = {72.0 rows, 240.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 192
      00-05              Project(T0¦¦**=[$0], n_nationkey=[$1]) : rowType = RecordType(DYNAMIC_STAR T0¦¦**, ANY n_nationkey): rowcount = 24.0, cumulative cost = {48.0 rows, 96.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 191
      00-06                Scan(table=[[dfs, tmp, multi]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/multi/t2/0_0_0.parquet], ReadEntryWithPath [path=file:/tmp/multi/t1/0_0_0.parquet]], selectionRoot=file:/tmp/multi, numFiles=2, numRowGroups=2, usedMetadataFile=false, columns=[`**`, `n_nationkey`]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY n_nationkey): rowcount = 24.0, cumulative cost = {24.0 rows, 48.0 cpu, 48.0 io, 0.0 network, 0.0 memory}, id = 190
      

      Attachments

        Issue Links

          Activity

            People

              volodymyr Vova Vysotskyi
              volodymyr Vova Vysotskyi
              Arina Ielchiieva Arina Ielchiieva
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: