[SPARK-8632] Poor Python UDF performance because of RDD caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.5.1, 1.6.0
Component/s: PySpark, SQL
Labels:
None

Target Version/s:

1.5.1, 1.6.0

Description

We have been running into performance problems using Python UDFs with DataFrames at large scale.

From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through).

In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns.

http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html

Attachments

Issue Links

is duplicated by

SPARK-10685 Misaligned data with RDD.zip and DataFrame.withColumn after repartition

Resolved

SPARK-10494 Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)

Resolved

links to

[Github] Pull Request #8662 (justinuang)

[Github] Pull Request #8833 (davies)

[Github] Pull Request #8835 (rxin)

Activity

People

Assignee:: Davies Liu

Reporter:: Justin Uang

Shepherd:: Davies Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jun/15 14:46

Updated:: 09/Apr/16 20:37

Resolved:: 22/Sep/15 21:23