Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8165

Planner does not push through predicates when there is a disjunction

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      If we take a simple query like such:

      select avg(ss_quantity)
      from store_sales 
      join household_demographics on (ss_hdemo_sk=hd_demo_sk)
      where (ss_sales_price between 0 and 100 and hd_dep_count = 1)
       or (ss_sales_price between 100 and 200 and hd_dep_count = 2);
      

      and look at the plan we see that neither table scan has any predicates pushed to it, the only filter is in the join

      (from impalad version 2.12.0-cdh5.16.x RELEASE (build 3f68649c7bf8a01fb6ba0cbe35dd2492adb836dd)

      PLAN-ROOT SINK
      |
      06:AGGREGATE [FINALIZE]
      | output: avg:merge(ss_quantity)
      |
      05:EXCHANGE [UNPARTITIONED]
      |
      03:AGGREGATE
      | output: avg(ss_quantity)
      |
      02:HASH JOIN [INNER JOIN, BROADCAST]
      | hash predicates: ss_hdemo_sk = hd_demo_sk
      | other predicates: (ss_sales_price >= 0 AND ss_sales_price <= 100 AND hd_dep_count = 1) OR 
      |                   (ss_sales_price >= 100 AND ss_sales_price <= 200 AND hd_dep_count = 2)
      | runtime filters: RF000 <- hd_demo_sk
      |
      |--04:EXCHANGE [BROADCAST]
      | |
      | 01:SCAN HDFS [tpcds_1000_parquet.household_demographics]
      | partitions=1/1 files=1 size=41.08KB
      |
      00:SCAN HDFS [tpcds_1000_parquet.store_sales]
       partitions=1824/1824 files=1824 size=189.24GB
       runtime filters: RF000 -> ss_hdemo_sk
      

      If we look at PostgreSQL 11.1, we'll see that not only does the join filter, but the table scan has the appropriate filters pushed to it.

       Finalize Aggregate  (cost=67549.69..67549.70 rows=1 width=32)
         ->  Gather  (cost=67549.47..67549.68 rows=2 width=32)
               Workers Planned: 2
               ->  Partial Aggregate  (cost=66549.47..66549.48 rows=1 width=32)
                     ->  Hash Join  (cost=113.12..66547.64 rows=734 width=4)
                           Hash Cond: (store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk)
                           Join Filter: (((store_sales.ss_sales_price >= '0'::numeric) AND 
                                          (store_sales.ss_sales_price <= '100'::numeric) AND 
                                          (household_demographics.hd_dep_count = 1)) OR 
                                         ((store_sales.ss_sales_price >= '100'::numeric) AND 
                                          (store_sales.ss_sales_price <= '200'::numeric) AND 
                                          (household_demographics.hd_dep_count = 2)))
                           ->  Parallel Seq Scan on store_sales  (cost=0.00..66343.20 rows=7305 width=22)
                                 Filter: (((ss_sales_price >= '0'::numeric) AND (ss_sales_price <= '100'::numeric)) OR 
                                          ((ss_sales_price >= '100'::numeric) AND (ss_sales_price <= '200'::numeric)))
                           ->  Hash  (cost=112.62..112.62 rows=40 width=8)
                                 ->  Seq Scan on household_demographics  (cost=0.00..112.62 rows=40 width=8)
                                       Filter: ((hd_dep_count = 1) OR (hd_dep_count = 2))
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amansinha Aman Sinha
            grahn Greg Rahn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment