[SPARK-15982] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

Issues with current reader behavior.

`text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
`textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
`orc()` does not have var args, inconsistent with others
`json(single-arg)` was removed, but that caused source compatibility issues - ~~SPARK-16009~~
user specified schema was not respected when text/csv/... were used with no args - ~~SPARK-16007~~

The solution I am implementing is to do the following.
1. For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
2. Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)

Attachments

Issue Links

relates to

SPARK-16009 DataFrameRead.json(path) compatibility broken with Spark 1.6

Resolved

SPARK-16007 Empty DataFrame created with spark.read.csv() does not respect user specified schema

Resolved

links to

[Github] Pull Request #13727 (tdas)

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Jun/16 05:01

Updated:: 20/Jun/16 21:53

Resolved:: 20/Jun/16 21:53