[SPARK-31945] Make more cache enable for Python UDFs. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.1.0
Component/s: PySpark
Labels:
None

Description

Currently the cache manager doesn't use the cache for udf if the udf is created again even if the functions is the same.

>>> func = lambda x: x

>>> df = spark.range(1)
>>> df.select(udf(func)("id")).cache()

>>> df.select(udf(func)("id")).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#14 AS <lambda>(id)#12]
+- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14]
 +- *(1) Range (0, 1, step=1, splits=12)

Attachments

Issue Links

links to

[Github] Pull Request #28774 (ueshin)

[Github] Pull Request #28774 (ueshin)

Activity

People

Assignee:: Takuya Ueshin

Reporter:: Takuya Ueshin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Jun/20 01:40

Updated:: 12/Dec/22 18:10

Resolved:: 10/Jun/20 07:39