[SPARK-32810] CSV/JSON data sources should avoid globbing paths when inferring schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.6, 3.0.0, 3.0.1, 3.1.0
Fix Version/s: 2.4.7, 3.0.2, 3.1.0
Component/s: SQL
Labels:
None

Description

The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: FileFormat.inferSchema expects actual file paths without glob patterns, but DataSource.paths expects file paths in glob patterns.
An example is demonstrated below:

^
|         DataSource.resolveRelation    tries to glob again (incorrectly) on glob pattern """[abc].csv"""
|         DataSource.apply                      ^
|       CSVDataSource.inferSchema               |
|     CSVFileFormat.inferSchema                 |
|   ...                                         |
|   DataSource.resolveRelation          globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern
|   DataSource.apply                            ^
| DataFrameReader.load                          |
|                                       input """\[abc\].csv"""

The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source.