Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32268

Bloom Filter Join

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.3.0
    • SQL

    Description

      We can improve the performance of some joins by pre-filtering one side of a join using a Bloom filter and IN predicate generated from the values from the other side of the join.
      For example:tpcds/q16.sql. Before this optimization. After this optimization.

      Query Performance Benchmarks: TPC-DS Performance Evaluation
      Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and Partitioned Parquet table

       

      Query Default(Seconds) Enable Bloom Filter Join(Seconds)
      tpcds q16 84 46
      tpcds q36 29 21
      tpcds q57 39 28
      tpcds q94 42 34
      tpcds q95 306 288

      Attachments

        1. q16-default.jpg
          2.15 MB
          Yuming Wang
        2. q16-bloom-filter.jpg
          2.35 MB
          Yuming Wang

        Activity

          People

            buyingyi Yingyi Bu
            yumwang Yuming Wang
            Votes:
            3 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: