[SPARK-22216] Improving PySpark/Pandas interoperability - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed

Description

This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.

Attachments

Issue Links

incorporates

SPARK-21187 Complete support for remaining Spark data types in Arrow Converters

Resolved

Sub-Tasks

1.	groupBy().apply() with pandas udf in pyspark	Resolved	Li Jin
2.	SPIP: Vectorized UDFs in Python	Resolved	Bryan Cutler
3.	Simple Vectorized Python UDFs using Arrow	Closed	Unassigned
4.	Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame	Resolved	Bryan Cutler
5.	User-defined window functions with pandas udf (unbounded window)	Resolved	Li Jin
6.	User-defined aggregation functions with pandas udf	Resolved	Li Jin
7.	Design doc for different types of pandas_udf	Resolved	Unassigned
8.	Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17	Resolved	Bryan Cutler
9.	Add function type argument to pandas_udf	Resolved	Li Jin
10.	Improve the description of Vectorized UDFs for non-deterministic cases	Resolved	Li Jin
11.	Register Scalar Vectorized UDFs for SQL Statement	Resolved	Xiao Li
12.	Using pandas_udf when inputs are not Pandas's Series or DataFrame	Resolved	Hyukjin Kwon
13.	Support alternative function form with group aggregate pandas UDF	Resolved	Li Jin
14.	Decrease memory consumption with toPandas() collection using Arrow	Resolved	Bryan Cutler
15.	Change MapVector to NullableMapVector in ArrowColumnVector	Resolved	Li Jin
16.	Rename Pandas UDFs	Resolved	Xiao Li
17.	Refactor group aggregate pandas UDF to its own catalyst rules	Resolved	Unassigned
18.	Pandas grouped udf on dataset with timestamp column error	Resolved	Li Jin
19.	Explicitly specify supported types in Pandas UDFs	Resolved	Hyukjin Kwon
20.	Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame	Resolved	Hyukjin Kwon
21.	Improve test cases for all supported types and unsupported types	Resolved	Aleksandr Koriagin
22.	Explicitly check supported types in toPandas	Resolved	Hyukjin Kwon
23.	Update Pandas UDFs section in sql-programming-guide	Resolved	Li Jin
24.	Support partial function and callable object with pandas UDF	Resolved	Unassigned
25.	Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator	Resolved	Li Jin
26.	Pandas Grouped Map UserDefinedFunction mixes column labels	Resolved	Bryan Cutler
27.	User-defined window functions with pandas udf (bounded window)	Resolved	Li Jin
28.	Support GROUPED_AGG_PANDAS_UDF in Pivot	Resolved	Unassigned
29.	Can not mix vectorized and non-vectorized UDFs	Resolved	Li Jin
30.	Fix pandas_udf with return type StringType() to handle str type properly in Python 2.	Resolved	Takuya Ueshin
31.	Allow None for Decimal type conversion (specific to PyArrow 0.9.0)	Resolved	Hyukjin Kwon
32.	Show some kind of test output to indicate pyarrow tests were run	Resolved	Bryan Cutler
33.	Improve toPandas with Arrow by sending out-of-order record batches	Resolved	Bryan Cutler
34.	Add an example for having two columns as the grouping key in group aggregate pandas UDF	Resolved	Hyukjin Kwon
35.	Register Grouped aggregate UDF Vectorized UDFs for SQL Statement	Resolved	Hyukjin Kwon
36.	Clarify/Improve EvalType for grouped aggregate and window aggregate	Resolved	Unassigned
37.	Internally document type conversion between Pandas data and SQL types in Pandas UDFs	Resolved	Hyukjin Kwon
38.	Update document type conversion for Pandas UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7)	Resolved	Hyukjin Kwon

Activity

People

Assignee:: Li Jin

Reporter:: Li Jin

Votes:: 0 Vote for this issue

Watchers:: 31 Start watching this issue

Dates

Created:: 06/Oct/17 15:20

Updated:: 12/Dec/22 18:11

Resolved:: 25/May/21 01:45