Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0, 3.4.0
Description
SPJ: Storage Partition Join works with Iceberg tables when both the tables have same number of buckets. As part of this feature request, we would like spark to gather the number of buckets information from both the tables and dynamically rebalance the number of buckets by coalesce or repartition so that SPJ will work fine. In this case, we would still have to shuffle but would be better than no SPJ.
Use Case :
Many times we do not have control of the input tables, hence it's not possible to change partitioning scheme on those tables. As a consumer, we would still like them to be used with SPJ when used with other tables and output tables which has different number of buckets.
In these scenario, we would need to read those tables rewrite them with matching number of buckets for the SPJ to work, this extra step could outweigh the benefits of less shuffle via SPJ. Also when there are multiple different tables being joined, each tables need to be rewritten with matching number of buckets.
If this feature is implemented, SPJ functionality will be more powerful.