Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32810

CSV/JSON data sources should avoid globbing paths when inferring schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.6, 3.0.0, 3.0.1, 3.1.0
    • 2.4.7, 3.0.2, 3.1.0
    • SQL
    • None

    Description

      The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: FileFormat.inferSchema expects actual file paths without glob patterns, but DataSource.paths expects file paths in glob patterns.
      An example is demonstrated below:

      ^
      |         DataSource.resolveRelation    tries to glob again (incorrectly) on glob pattern """[abc].csv"""
      |         DataSource.apply                      ^
      |       CSVDataSource.inferSchema               |
      |     CSVFileFormat.inferSchema                 |
      |   ...                                         |
      |   DataSource.resolveRelation          globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern
      |   DataSource.apply                            ^
      | DataFrameReader.load                          |
      |                                       input """\[abc\].csv"""
      

      The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source.

      Attachments

        Activity

          People

            maxgekk Max Gekk
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: