[SPARK-19919] Defer input path validation into DataSource in CSV datasource - ASF JIRA

XML

Word

Printable

JSON

Currently, if other datasources fail to infer the schema, it returns None and then this is being validated in DataSource as below:

scala> spark.read.json("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;

scala> spark.read.orc("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;

scala> spark.read.parquet("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

However, CSV it checks it within the datasource implementation and throws another exception message as below:

scala> spark.read.csv("emptydir")
java.lang.IllegalArgumentException: requirement failed: Cannot infer schema from an empty set of files

We could remove this duplicated check and validate this in one place in the same way with the same message.

links to

[Github] Pull Request #17256 (HyukjinKwon)