Description
A majority of Spark SQL queries likely run though HadoopFSRelation, however there are currently several complexity and performance problems with this code path:
- The class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.
- For very large tables, we are broadcasting the entire list of files to every executor.
SPARK-11441 - For partitioned tables, we always do an extra projection. This results not only in a copy, but undoes much of the performance gains that we are going to get from vectorized reads.
This is an umbrella ticket to track a set of improvements to this codepath.
Attachments
Issue Links
- is duplicated by
-
SPARK-8813 Combine files when there're many small files in table
- Resolved
- links to