Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11707

Wrong results when global runtime IN-list filters are applied

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 4.1.0, Impala 4.1.1
    • Impala 4.2.0
    • Backend

    Description

      Found this bug when doing a large scale TPC-H benchmark. The bug can be reproduced by the following query:

      use tpch_orc_def;
      set enabled_runtime_filter_types=in_list;
      select count(*) from supplier, nation, region
      where s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE';

      The result is 0 which is wrong. The expected result is 1987. The summary shows that ScanNode on "nation" table returns 0 rows:

      04:HASH JOIN                  1      1  445.629us  445.629us      0       2.00K    1.98 MB        1.94 MB  INNER JOIN, BROADCAST 
      |--07:EXCHANGE                1      1   40.466us   40.466us      1           1   16.00 KB       16.00 KB  BROADCAST             
      |  F02:EXCHANGE SENDER        1      1  217.341us  217.341us                       8.60 KB       99.20 KB                        
      |  02:SCAN HDFS               1      1    4.507ms    4.507ms      1           1  917.09 KB       96.00 MB  tpch_orc_def.region   
      03:HASH JOIN                  1      1    2.112ms    2.112ms      0      10.00K    1.97 MB        1.94 MB  INNER JOIN, BROADCAST 
      |--06:EXCHANGE                1      1   27.803us   27.803us      0          25          0       16.00 KB  BROADCAST             
      |  F01:EXCHANGE SENDER        1      1   89.872us   89.872us                      25.59 KB       32.00 KB                        
      |  01:SCAN HDFS               1      1   12.833ms   12.833ms      0          25   32.00 KB       64.00 MB  tpch_orc_def.nation   
      00:SCAN HDFS                  1      1  371.636us  371.636us      0      10.00K   16.00 KB       32.00 MB  tpch_orc_def.supplier 

      There is a runtime IN-list filter applied on this node:

      01:SCAN HDFS [tpch_orc_def.nation, RANDOM]
         HDFS partitions=1/1 files=1 size=1.74KB
         runtime filters: RF000[in_list] -> n_regionkey
         stored statistics:
           table: rows=25 size=1.74KB
           columns: all 
         extrapolated-rows=disabled max-scan-range-rows=25
         mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1
         tuple-ids=1 row-size=4B cardinality=25
         in pipelines: 01(GETNEXT)

      The filter is generated from a build side which is reading the "region" table which predicate "r_name = 'EUROPE'". Note that it's a global runtime filter generated by other impalads (not the impalad scanning the "nation" table).

      The profile shows that this filter rejects one file which is the exact one file of "nation" table.

              Filter 0 (2.00 KB):
                 - Files processed: 1 (1)
                 - Files rejected: 1 (1)
                 - Files total: 1 (1)

      This is wrong since at least 5 rows in the file should pass the filter:

      impala-shell> select count(*) from nation, region where n_regionkey = r_regionkey and r_name = 'EUROPE';
      +----------+
      | count(*) |
      +----------+
      | 5        |
      +----------+

      Attachments

        Issue Links

          Activity

            People

              stigahuang Quanlong Huang
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: