[SPARK-44076] SPIP: Python Data Source API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

This proposal aims to introduce a simple API in Python for Data Sources. The idea is to enable Python developers to create data sources without having to learn Scala or deal with the complexities of the current data source APIs. The goal is to make a Python-based API that is simple and easy to use, thus making Spark more accessible to the wider Python developer community. This proposed approach is based on the recently introduced Python user-defined table functions (SPARK-43797) with extensions to support data sources.

SPIP: https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

Attachments

Sub-Tasks

1.	Initial support for Python data source read API	Resolved	Allison Wang
2.	Support registering Python data sources	Resolved	Allison Wang
3.	Support loading Python data sources in DataFrameReader	Resolved	Allison Wang
4.	Add InputPartition to DataSourceReader interface	Resolved	Allison Wang
5.	Add Python data source write API	Resolved	Allison Wang
6.	Make Python data source registration session level	Resolved	Allison Wang
7.	Plan Python data source read using mapInArrow	Resolved	Allison Wang
8.	Change saveMode to overwrite for DataSourceWriter constructor	Resolved	Allison Wang
9.	Support spark.read.schema(...) for Python data source API	Resolved	Unassigned
10.	Respect column names when Python data source read function outputs named Row objects	Resolved	Allison Wang
11.	Initial support for Python data source write API	Resolved	Allison Wang
12.	Support spark.read.load() with non-empty path for Python data source API	Open	Unassigned
13.	Support creating table using a Python data source in SQL	Resolved	Hyukjin Kwon
14.	Support `commit` and `abort` API for Python data source write	Resolved	Allison Wang
15.	Support overwrite mode for Python data source write	Resolved	Allison Wang
16.	Investigate runtime registration and feasibility of overwriting the datasource	Resolved	Unassigned
17.	Statically register Python Data Source	Resolved	Hyukjin Kwon
18.	Update `path` handling in Python data source	Resolved	Allison Wang
19.	Allow non-deterministic Python UDFs in MapInPandas/MapInArrow	Resolved	Allison Wang
20.	Support create table using DSv2 sources	Resolved	Allison Wang
21.	Support CTAS using DSv2 sources	Resolved	Allison Wang
22.	Support INSERT INTO/OVERWRITE using DSv2 sources	Resolved	Allison Wang
23.	Add documentation for Python data source API	Resolved	Allison Wang
24.	Refactor Python Data Source instance loading	Resolved	Hyukjin Kwon
25.	Support PythonSQLMetrics.pythonMetrics	Resolved	Hyukjin Kwon
26.	Add a new API in DSv2 DataWriter to write an iterator of records	Resolved	Allison Wang
27.	Block Python data source registration with name conflicts	Resolved	Allison Wang
28.	Improve error messages for invalid save mode	Resolved	Allison Wang
29.	Check Python executable when looking up available Data Sources	Resolved	Hyukjin Kwon
30.	Improve Python data source error classes and messages	Resolved	Allison Wang
31.	Python data source options should be a case insensitive dictionary	Resolved	Allison Wang
32.	Improve error messages for unsupported data source save mode	Resolved	Allison Wang
33.	Log full exception when failed to lookup Python Data Sources	Resolved	Hyukjin Kwon
34.	Disallow re-registration of statically registered data sources	Open	Unassigned
35.	Improve error messages for DATA_SOURCE_NOT_FOUND error	Resolved	Allison Wang
36.	Make DataSourceManager isolated and self clone-able	Resolved	Hyukjin Kwon
37.	Refactor Python Data Source to align with other built-in Data Sources	Resolved	Hyukjin Kwon
38.	Skip test_datasource if PyArrow is not installed	Resolved	Hyukjin Kwon
39.	Skip V2 table lookup when a table is in V1 table cache	Resolved	Allison Wang
40.	Make daemon mode configurable when creating Python workers	Resolved	Allison Wang
41.	Support Python data source API with Spark Connect	Resolved	Allison Wang
42.	Fix docstring links and type hints in Python Data Source	Resolved	Hyukjin Kwon
43.	Document Python Data Source API in API reference page	Resolved	Hyukjin Kwon
44.	Remove the private[sql] modifier for Python data sources	Resolved	Allison Wang
45.	Add user guide for batch data source write API	Resolved	Allison Wang
46.	Refine Python data source API docstring and type hints	Resolved	Allison Wang
47.	Fix Python data source error class references	Resolved	Allison Wang
48.	Add a simple data source example in the user guide	Resolved	Allison Wang
49.	Make static import Python data source configurable	Open	Unassigned
50.	Avoid static Python data source lookup when using builtin or Java data sources	Resolved	Allison Wang
51.	Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance	Resolved	Luca Canali
52.	Support Arrow-Based Python Data Source Writer	Resolved	Allison Wang
53.	Avoid wrapping Python data source error messages thrown during planning	Resolved	Allison Wang

Activity

People

Assignee:: Unassigned

Reporter:: Allison Wang

Shepherd:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Jun/23 04:05

Updated:: 10/Oct/23 18:20