[SPARK-48030] InternalRowComparableWrapper should cache rowOrdering to improve performace - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.1, 3.4.3
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

InternalRowComparableWrapper recreates row ordering for each output partition when SPJ is enabled. The row ordering is generated via codegen which is quite expensive and the output partitions might be quite large for production table such as hundreds of thousands partitions. We encountered this issue when applying SPJ with multiple large Iceberg tables and the plan phase took tens of minutes to complete.

Attaching a screenshot to provide related stack trace:

A simple fix for this would be caching the rowOrdering for InternalRowComparableWrapper as the datatype of the InternalRow is immutable

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
28/Apr/24 12:30
403 kB
YE

Issue Links

is related to

SPARK-37375 Umbrella: Storage Partitioned Join (SPJ)

Resolved

links to

GitHub Pull Request #46265

Activity

People

Assignee:: YE

Reporter:: YE

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Apr/24 12:30

Updated:: 30/Apr/24 06:28

Resolved:: 30/Apr/24 04:26