Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32628

Use bloom filter to improve dynamicPartitionPruning

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      It will throw exception when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is disabled:

      select catalog_sales.* from  catalog_sales join catalog_returns  where cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 40000;
      
      20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
      

      We can improve it with minimum, maximum and Bloom filter to reduce serialized results.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yumwang Yuming Wang
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: