[SPARK-43797] Python User-defined Table Functions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0, 4.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

This is an umbrella ticket to support Python user-defined table functions.

Attachments

Issue Links

relates to

SPARK-39247 Support returning a table or set of rows in CREATE FUNCTION

Open

Sub-Tasks

1.	Initial support for Python UDTFs	Resolved	Allison Wang
2.	Support arrow-optimized Python UDTFs	Resolved	Allison Wang
3.	Support Python UDTFs in Spark Connect	Resolved	Allison Wang
4.	Support non-deterministic Python UDTFs	Resolved	Allison Wang
5.	Support Python UDTFs with empty return values	Resolved	Allison Wang
6.	Improve error messages for Python UDTFs with wrong number of outputs	Resolved	Allison Wang
7.	Improve error messages for Python UDTF arrow type casts	Resolved	Allison Wang
8.	Improve error messages for Python UDTF returning non iterable	Resolved	Allison Wang
9.	Fix AssertionError when converting UDTF output to a complex type	Resolved	Takuya Ueshin
10.	Improve error messages for creating Python UDTFs with pickling errors	Resolved	Allison Wang
11.	Disable arrow optimization by default for Python UDTFs	Resolved	Allison Wang
12.	Improve error messages for regular Python UDTFs that return non-tuple values	Resolved	Allison Wang
13.	Include the name of the UDTF in the error messages generated during the function execution	Open	Unassigned
14.	Support profiler for Python UDTFs	Open	Unassigned
15.	Refactor PythonUDTFRunner to send its return type separately	Resolved	Takuya Ueshin
16.	Support for UDTF to analyze in Python	Resolved	Takuya Ueshin
17.	Support Python UDTFs with empty schema	Resolved	Takuya Ueshin
18.	Query planning to support PARTITION BY and ORDER BY clause for table arguments	Resolved	Daniel
19.	Add support for accumulator, broadcast, and Spark files in Python UDTF's analyze.	Resolved	Takuya Ueshin
20.	Set up memory limits for analyze in Python.	Resolved	Takuya Ueshin
21.	Add user guide for Python UDTFs	Resolved	Allison Wang
22.	Improve the documentation for TABLE input arguments for UDTFs	Resolved	Daniel
23.	Query execution to support PARTITION BY and ORDER BY clause for table arguments	Resolved	Daniel
24.	Support named arguments in Python UDTF	Resolved	Takuya Ueshin
25.	Cache the pandas converter for Python UDTFs	Resolved	Allison Wang
26.	Make Python UDTFs by default non-deterministic	Resolved	Allison Wang
27.	Add SQL query test suites for Python UDTFs	Resolved	Allison Wang
28.	Refactor Arrow Python UDTF	Resolved	Takuya Ueshin
29.	Improve Python UDTF arrow serializer performance	Open	Michael Zhang
30.	Add API in 'analyze' method to return partitioning/ordering expressions	Resolved	Daniel
31.	Project out PARTITION BY expressions before 'eval' method consumes input rows	Resolved	Daniel
32.	Add a new method `cleanup` in the UDTF interface	Resolved	Allison Wang
33.	Add API for 'analyze' method to return a buffer to be consumed on each class creation	Resolved	Daniel
34.	Refactor analyzeInPython function to make it reusable	Resolved	Allison Wang
35.	Return useful error message if UDTF returns None for non-nullable column	Resolved	Daniel
36.	Return specific error messages if UDTF 'analyze' method accepts or returns wrong values	Resolved	Daniel
37.	Create API to stop consuming rows from the input table	Resolved	Daniel
38.	Update API for 'analyze' partitioning/ordering columns to support general expressions	Resolved	Daniel
39.	Create API to acquire execution memory for 'eval' and 'terminate' methods	Closed	Unassigned
40.	Create API for 'analyze' method to indicate subset of input table columns to select	Resolved	Daniel
41.	Enforce that 'AnalyzeResult' 'orderBy' field is a list of pyspark.sql.functions.OrderingColumn	Resolved	Daniel
42.	Create API for 'analyze' method to send input column(s) to output table unchanged	Resolved	Unassigned
43.	Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments	Resolved	Daniel
44.	Support running Python UDTF 'analyze' method from Spark executors	Resolved	Unassigned
45.	Analyzer bug with multiple ORDER BY items for input table argument	Resolved	Daniel
46.	[Bug] Partition indices are incorrect when UDTF analyze() uses both select and partitionColumns	Resolved	Daniel

Activity

People

Assignee:: Unassigned

Reporter:: Allison Wang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/May/23 17:40

Updated:: 08/Jun/24 17:54