Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32628

Use bloom filter to improve dynamicPartitionPruning

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      It will throw exception when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is disabled:

      select catalog_sales.* from  catalog_sales join catalog_returns  where cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 40000;
      
      20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
      

      We can improve it with minimum, maximum and Bloom filter to reduce serialized results.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yumwang Yuming Wang

            Dates

              Created:
              Updated:

              Slack

                Issue deployment