[SPARK-21404] Simple Vectorized Python UDFs using Arrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

Using Arrow, Python UDFs can be evaluated in vectorized form by using the column data as Pandas.Series. This will offer a performance gain by computing the return column data in one operation instead of iterating over each row to calculate a single element and appending to a list, as is currently done. The existing Python UDF api can be used to implement this, which specifies the return type, and since not all functions may be able to be vectorized there would need to be a way to enable this optimizaiton, such as a SQLConf.

This is designed as a preliminary step for the existing SPIP: Vectorized UDFs in Python ~~SPARK-21190~~ that could be used as a basis for whatever expanded API is decided upon there.

Attachments

Issue Links

relates to

SPARK-21190 SPIP: Vectorized UDFs in Python

Resolved

links to

[Github] Pull Request #18659 (BryanCutler)

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Jul/17 17:06

Updated:: 07/Oct/17 14:09

Resolved:: 22/Sep/17 16:55