Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32810

CSV/JSON data sources should avoid globbing paths when inferring schema

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.6, 3.0.0, 3.0.1, 3.1.0
    • 2.4.7, 3.0.2, 3.1.0
    • SQL
    • None

    Description

      The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: FileFormat.inferSchema expects actual file paths without glob patterns, but DataSource.paths expects file paths in glob patterns.
      An example is demonstrated below:

      ^
      |         DataSource.resolveRelation    tries to glob again (incorrectly) on glob pattern """[abc].csv"""
      |         DataSource.apply                      ^
      |       CSVDataSource.inferSchema               |
      |     CSVFileFormat.inferSchema                 |
      |   ...                                         |
      |   DataSource.resolveRelation          globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern
      |   DataSource.apply                            ^
      | DataFrameReader.load                          |
      |                                       input """\[abc\].csv"""
      

      The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            maxgekk Max Gekk
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment