Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32268

Bloom Filter Join

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.3.0
    • SQL
    • None

    Description

      We can improve the performance of some joins by pre-filtering one side of a join using a Bloom filter and IN predicate generated from the values from the other side of the join.
      For example:tpcds/q16.sql. Before this optimization. After this optimization.

      Query Performance Benchmarks: TPC-DS Performance Evaluation
      Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and Partitioned Parquet table

       

      Query Default(Seconds) Enable Bloom Filter Join(Seconds)
      tpcds q16 84 46
      tpcds q36 29 21
      tpcds q57 39 28
      tpcds q94 42 34
      tpcds q95 306 288

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            buyingyi Yingyi Bu Assign to me
            yumwang Yuming Wang
            Votes:
            3 Vote for this issue
            Watchers:
            24 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment