[SPARK-29881] Introduce API for manually breaking up dataset plan - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Trivial
Resolution: Incomplete
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through transform.

Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example:

SELECT transform(sequence(0, 32), x -> expensive_result * x)
FROM (
SELECT expensive_operation(foo) AS expensive_result FROM source
)

the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately.

I've found a workaround for now is to insert something like {{.filter { _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once.

But it would be great to have an API on Dataset like .barrier() to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Devyn Cairns

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Nov/19 14:02

Updated:: 25/May/21 01:55

Resolved:: 25/May/21 01:43