Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6727

JPPD does not eliminate rows using the bloom filter if a HashJoin is involved

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.15.0
    • None
    • Execution - Flow
    • None
    • Important

    Description

      When testing a simple join between 2 tables, it appears that the Bloom-filter based predicate pushdown will work only for broadcast joins, but not for hash-based joins.

      Since the purpose of the filter is to reduce the number of records being hashed across the fragments, the runtime does not improve.

      Join Query (TPCH dataset):

      select
      l.l_orderkey
      , sum(l.l_extendedprice * (1 - l.l_discount)) as revenue
      , o.o_orderdate
      , o.o_shippriority
      from
      orders o
      , lineitem l
      where
      l.l_orderkey = o.o_orderkey
      and o.o_orderdate = date '1994-08-26'
      and MOD(o.o_custkey,10) = 1
      group by
      l.l_orderkey
      , o.o_orderdate
      , o.o_shippriority
      order by
      revenue desc
      , o.o_orderdate limit 10;
      
      

      This generates an output of about 6K rows from the build side, with the expectation of 10M rows being joined from the probe side.

      Following are the results of the following query:

      Join Mode Profile Runtime Status
      BCastJoin w/o JPPD bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json 3.148sec As expected. 600M rows are scanned and probed against the locally available hash table.
      BCastJoin w/ JPPD bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json 3.570sec 04-xx-06 shows a reduction in rows. 600M rows are scanned, but only 10M rows are probed against the locally available hash table.
      HashJoin w/o JPPD hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json 5.861sec As expected. 600M rows are scanned and probed against the hash table.
      HashJoin w/ JPPD hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json 8.376sec 03-xx-07 is not seeing a reduction in rows. All 600M rows are scanned and probed against the hash table.

      There are a few possibilities of why the RuntimeFilter does not eliminate any rows when a HashJoin is involved.
      1. The RuntimeFilter operator does not have a bloom-filter
      2. The RuntimeFilter receives the bloom-filter after the scan completes, because the foreman has not finished building and distributing the global bloom-filter
      3. The RuntimeFilter receives the bloom-filter during the scan, but does not apply it.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            weijie Weijie Tong
            kkhatua Kunal Khatua
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment