[SPARK-30153] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Spark has introduced vectorized UDF with pandas_udf and this provides considerable speed up by reducing the overhead due to serialization and deserialization, where applciable.
The current implementation of pandas_udf uses Arrow for fast serialization and then Pandas Series (or Pandas DF) for processing.
There are opportunities to improve UDF performance, in certain cases, by bypaasing the conversion to and from Pandas and using Arrow Tables, directly with the help of specialized libraries able to process Arrow Tables and Arrays.
One such case is for scientific computing of high energy physics data, where processing of arrays of data is of key importance.
A test case using such approach has shown an increase of performance of about 3x, compared to the equivalent processing with pandas_udf, for a UDF based on plain Arrow serialization using a custom-developed extension of pandas_udf. Processing of Arrow data in the test case was done via the "awkward arrays" library (https://github.com/scikit-hep/awkward-array).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Flamegraph_test_pandas_udf_SCALAR_ARROW.png
06/Dec/19 16:54
83 kB
Luca Canali
Flamegraph_test_pandas_udf_SCALAR.png
06/Dec/19 16:54
113 kB
Luca Canali

Issue Links

is superceded by

SPARK-37227 DataFrame.mapInArrow

Open

links to

GitHub Pull Request #26783

Activity

People

Assignee:: Unassigned

Reporter:: Luca Canali

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Dec/19 16:11

Updated:: 12/Dec/22 18:10

Resolved:: 07/Nov/21 05:11