[SPARK-13664] Simplify and Speedup HadoopFSRelation - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

A majority of Spark SQL queries likely run though HadoopFSRelation, however there are currently several complexity and performance problems with this code path:

The class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.
For very large tables, we are broadcasting the entire list of files to every executor. ~~SPARK-11441~~
For partitioned tables, we always do an extra projection. This results not only in a copy, but undoes much of the performance gains that we are going to get from vectorized reads.

This is an umbrella ticket to track a set of improvements to this codepath.

Attachments

Issue Links

Add Link

is duplicated by

SPARK-8813 Combine files when there're many small files in table

Resolved

Delete this link

links to

[Github] Pull Request #11572 (marmbrus)

Delete this link

[Github] Pull Request #11646 (marmbrus)

Delete this link

Sub-Tasks

Create Sub-Task

1.	Initial separation of concerns in HadoopFSRelation	Resolved	Michael Armbrust	Actions
2.	Reimplement CommitFailureTestRelationSuite	Resolved	Cheng Lian	Actions
3.	Fix ORC PPD	Resolved	Hyukjin Kwon	Actions
4.	Clean up ResolveDataSource	Resolved	Michael Armbrust	Actions
5.	Fix sizeInBytes for HadoopFSRelation	Resolved	Davies Liu	Actions
6.	Strategy for planning scans of files	Resolved	Michael Armbrust	Actions
7.	buildReader implementation for parquet	Resolved	Michael Armbrust	Actions
8.	implement buildReader for text data source	Resolved	Wenchen Fan	Actions
9.	buildReader implementation for ORC	Resolved	Cheng Lian	Actions
10.	implement buildReader for json data source	Resolved	Wenchen Fan	Actions
11.	buildReader implementation for CSV	Resolved	Cheng Lian	Actions
12.	Rename "spark.sql.parquet.fileScan"	Resolved	Cheng Lian	Actions
13.	Add FileFormat.prepareRead to collect necessary global information	Resolved	Cheng Lian	Actions
14.	buildReader implementation for LibSVM	Resolved	Cheng Lian	Actions
15.	Implement preferredLocations() for FileScanRDD	Resolved	Cheng Lian	Actions
16.	Hide HadoopFsRelation related data source API to execution package	Resolved	Cheng Lian	Actions
17.	Remove buildInternalScan from FileFormat	Resolved	Wenchen Fan	Actions
18.	Add inputMetrics to FileScanRDD	Resolved	Wenchen Fan	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Michael Armbrust

Reporter:: Michael Armbrust

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Mar/16 02:43

Updated:: 06/Sep/16 10:38

Resolved:: 29/Apr/16 21:36

Agile

View on Board

Simplify and Speedup HadoopFSRelation

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment