[SPARK-15689] Data source API v2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
- SPIP
- releasenotes

Target Version/s:

2.3.0

Description

This ticket tracks progress in creating the v2 of data source API. This new API should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers.

3. Still support filter push down, similar to the existing API.

4. Nice-to-have: support additional common operators, including limit and sampling.

Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPIP Data Source API V2.pdf
17/Aug/17 12:29
98 kB
Wenchen Fan

Issue Links

is related to

SPARK-15687 Columnar execution engine

Closed

relates to

SPARK-19351 Support for obtaining file splits from underlying InputFormat

Resolved

links to

[Github] Pull Request #19136 (cloud-fan)

SPIP: Data Source API V2

Sub-Tasks

1.	data source v2 write path	Resolved	Wenchen Fan
2.	clarify exception behaviors for all data source v2 interfaces	Resolved	Wenchen Fan
3.	push down operators to data source before planning	Resolved	Wenchen Fan
4.	mix-in interface should extend the interface it aimed to mix in	Resolved	Wenchen Fan
5.	remove V2 from the class name of data source reader/writer	Resolved	Wenchen Fan
6.	propagate session configs to data source read/write options	Resolved	Xingbo Jiang
7.	partitioning reporting	Resolved	Wenchen Fan
8.	columnar reader interface	Resolved	Wenchen Fan
9.	rename some APIs and classes to make their meaning clearer	Resolved	Zhenhua Wang
10.	DataSourceV2Options should have getInt, getBoolean, etc.	Resolved	Sunitha Kambhampati
11.	Rename ReadTask to DataReaderFactory	Resolved	Gengliang Wang

Activity

People

Assignee:: Wenchen Fan

Reporter:: Reynold Xin

Shepherd:: Reynold Xin

Votes:: 2 Vote for this issue

Watchers:: 88 Start watching this issue

Dates

Created:: 01/Jun/16 05:32

Updated:: 09/Oct/19 15:35

Resolved:: 23/Jan/18 15:26