[SPARK-18748] UDF multiple evaluations causes very poor performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0, 2.4.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

We have a use case where we have a relatively expensive UDF that needs to be calculated. The problem is that instead of being calculated once, it gets calculated over and over again.
for example:

def veryExpensiveCalc(str:String) = {println("blahblah1"); "nothing"}
hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is not null and c<>''").show

with the output:

blahblah1
blahblah1
blahblah1
-------

c

-------

nothing

-------

You can see that for each reference of column "c" you will get the println.
that causes very poor performance for our real use case.
This also came out on StackOverflow:
http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/

with two problematic work-arounds:
1. cache() after the first time. e.g.

hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not null and c<>''").show

while it works, in our case we can't do that because the table is too big to cache.

2. move back and forth to rdd:

val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and c<>''").show

which works but then we loose some of the optimizations like push down predicate features, etc. and its very ugly.

Any ideas on how we can make the UDF get calculated just once in a reasonable way?

Attachments

Issue Links

duplicates

SPARK-17728 UDFs are run too many times

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ohad Raviv

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 06/Dec/16 19:42

Updated:: 25/May/21 01:54

Resolved:: 25/May/21 01:43