[SPARK-17728] UDFs are run too many times - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

Databricks Cloud / Spark 2.0.0

Description

Background

Llonger running processes that might run analytics or contact external services from UDFs. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.

Steps to Reproduce

Create some sample data.
Create a UDF that returns a multiple attributes.
Run UDF over some data.
Create new columns from the multiple attributes.
Observe run time.

Actual Results

The UDF is executed multiple times per row.

Expected Results

The UDF should only be executed once per row.

Workaround

Cache the Dataset after UDF execution.

Details

For code and more details, see over_optimized_udf.html

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

over_optimized_udf.html
29/Sep/16 17:45
43 kB
Jacob Eisinger

Issue Links

Blocked

SPARK-27245 Optimizer repeat Python UDF calls

Resolved

duplicates

SPARK-12352 Reuse the result of split in SQL

Closed

is duplicated by

SPARK-18748 UDF multiple evaluations causes very poor performance

Resolved

SPARK-18747 UDF multiple evaluations causes very poor performance

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jacob Eisinger

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Sep/16 17:43

Updated:: 23/Jun/22 01:09

Resolved:: 30/Sep/16 02:27