[SPARK-18726] Filesystem unnecessarily scanned twice during creation of non-catalog table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: SQL
Labels:
None

Description

It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan the filesystem twice, once for schema inference, and another to create a FileIndex class for the relation.

It would be better to combine these scans somehow, since this is the most costly step of creating a table. This is a follow-up ticket to https://github.com/apache/spark/pull/16090.

cc cloud_fan

Attachments

Issue Links

links to

[Github] Pull Request #17081 (windpiger)

Activity

People

Assignee:: Song Jun

Reporter:: Eric Liang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Dec/16 22:11

Updated:: 03/Mar/17 07:54

Resolved:: 03/Mar/17 07:54