[SPARK-47094] SPJ : Dynamically rebalance number of buckets when they are not equal - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0, 3.4.0
Fix Version/s: 4.0.0
Component/s: Spark Core
Labels:
- pull-request-available

Language:
- Java
- Python
- Scala

Description

SPJ: Storage Partition Join works with Iceberg tables when both the tables have same number of buckets. As part of this feature request, we would like spark to gather the number of buckets information from both the tables and dynamically rebalance the number of buckets by coalesce or repartition so that SPJ will work fine. In this case, we would still have to shuffle but would be better than no SPJ.

Use Case :

Many times we do not have control of the input tables, hence it's not possible to change partitioning scheme on those tables. As a consumer, we would still like them to be used with SPJ when used with other tables and output tables which has different number of buckets.

In these scenario, we would need to read those tables rewrite them with matching number of buckets for the SPJ to work, this extra step could outweigh the benefits of less shuffle via SPJ. Also when there are multiple different tables being joined, each tables need to be rewritten with matching number of buckets.

If this feature is implemented, SPJ functionality will be more powerful.

Attachments

Issue Links

links to

GitHub Pull Request #45267

GitHub Pull Request #47126

Activity

People

Assignee:: Szehon Ho

Reporter:: Himadri Pal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Feb/24 22:56

Updated:: 30/Oct/24 17:11

Resolved:: 06/Apr/24 03:12