Details
Description
The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: FileFormat.inferSchema expects actual file paths without glob patterns, but DataSource.paths expects file paths in glob patterns.
An example is demonstrated below:
^ | DataSource.resolveRelation tries to glob again (incorrectly) on glob pattern """[abc].csv""" | DataSource.apply ^ | CSVDataSource.inferSchema | | CSVFileFormat.inferSchema | | ... | | DataSource.resolveRelation globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern | DataSource.apply ^ | DataFrameReader.load | | input """\[abc\].csv"""
The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source.
Attachments
Issue Links
- is related to
-
SPARK-32815 Fix LibSVM data source loading error on file paths with glob metacharacters
- Resolved
- links to