Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37375 Umbrella: Storage Partitioned Join (SPJ)
  3. SPARK-47094

SPJ : Dynamically rebalance number of buckets when they are not equal

    XMLWordPrintableJSON

Details

    Description

      SPJ: Storage Partition Join works with Iceberg tables when both the tables have same number of buckets. As part of this feature request, we would like spark to gather the number of buckets information from both the tables and dynamically rebalance the number of buckets by coalesce or repartition so that SPJ will work fine. In this case, we would still have to shuffle but would be better than no SPJ.

      Use Case : 

      Many times we do not have control of the input tables, hence it's not possible to change partitioning scheme on those tables. As a consumer, we would still like them to be used with SPJ when used with other tables and output tables which has different number of buckets.

      In these scenario, we would need to read those tables rewrite them with matching number of buckets for the SPJ to work, this extra step could outweigh the benefits of less shuffle via SPJ. Also when there are multiple different tables being joined, each tables need to be rewritten with matching number of buckets. 

      If this feature is implemented, SPJ functionality will be more powerful.

      Attachments

        Issue Links

          Activity

            People

              szehon Szehon Ho
              hpal Himadri Pal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: