[SPARK-27453] DataFrameWriter.partitionBy is Silently Dropped by DSV1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.4.1
Fix Version/s: 2.4.2, 3.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.4.2, 3.0.0

Description

This is a long standing quirk of the interaction between DataFrameWriter and CreatableRelationProvider (and the other forms of the DSV1 API). Users can specify columns in partitionBy and our internal data sources will use this information. Unfortunately, for external systems, this data is silently dropped with no feedback given to the user.

In the long run, I think that DataSourceV2 is a better answer. However, I don't think we should wait for that API to stabilize before offering some kind of solution to developers of external data sources. I also do not think we should break binary compatibility of this API, but I do think that small surgical fix could alleviate the issue.

I would propose that we could propagate partitioning information (when present) along with the other configuration options passed to the data source in the String, String map.

I think its very unlikely that there are both data sources that validate extra options and users who are using (no-op) partitioning with them, but out of an abundance of caution we should protect the behavior change behind a legacy flag that can be turned off.

Attachments

Issue Links

links to

[Github] Pull Request #24784 (liwensun)

GitHub Pull Request #24365

Activity

People

Assignee:: Liwen Sun

Reporter:: Michael Armbrust

Shepherd:: Michael Armbrust

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/Apr/19 21:45

Updated:: 03/Jun/19 21:45

Resolved:: 16/Apr/19 22:39