Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
2.0.0
-
None
-
None
-
Databricks Cloud / Spark 2.0.0
Description
Background
Llonger running processes that might run analytics or contact external services from UDFs. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.
Steps to Reproduce
- Create some sample data.
- Create a UDF that returns a multiple attributes.
- Run UDF over some data.
- Create new columns from the multiple attributes.
- Observe run time.
Actual Results
The UDF is executed multiple times per row.
Expected Results
The UDF should only be executed once per row.
Workaround
Cache the Dataset after UDF execution.
Details
For code and more details, see over_optimized_udf.html
Attachments
Attachments
Issue Links
- Blocked
-
SPARK-27245 Optimizer repeat Python UDF calls
- Resolved
- duplicates
-
SPARK-12352 Reuse the result of split in SQL
- Closed
- is duplicated by
-
SPARK-18748 UDF multiple evaluations causes very poor performance
- Resolved
-
SPARK-18747 UDF multiple evaluations causes very poor performance
- Closed