Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32810

CSV/JSON data sources should avoid globbing paths when inferring schema

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.6, 3.0.0, 3.0.1, 3.1.0
    • Fix Version/s: 2.4.7, 3.0.2, 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: FileFormat.inferSchema expects actual file paths without glob patterns, but DataSource.paths expects file paths in glob patterns.
      An example is demonstrated below:

      ^
      |         DataSource.resolveRelation    tries to glob again (incorrectly) on glob pattern """[abc].csv"""
      |         DataSource.apply                      ^
      |       CSVDataSource.inferSchema               |
      |     CSVFileFormat.inferSchema                 |
      |   ...                                         |
      |   DataSource.resolveRelation          globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern
      |   DataSource.apply                            ^
      | DataFrameReader.load                          |
      |                                       input """\[abc\].csv"""
      

      The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source.

        Attachments

          Activity

            People

            • Assignee:
              maxgekk Max Gekk
              Reporter:
              maxgekk Max Gekk
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: