[SPARK-9926] Parallelize file listing for partitioned Hive table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.1, 1.5.0
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

In Spark SQL, short queries like select * from table limit 10 run very slowly against partitioned Hive tables because of file listing. In particular, if a large number of partitions are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks in my environment-

Parquet-backed Hive table
Partitioned by dateint and hour
Stored on S3

# of partitions	# of files	runtime	query
1	972	30 secs	select * from nccp_log where dateint=20150601 and hour=0 limit 10;
24	13646	6 mins	select * from nccp_log where dateint=20150601 limit 10;
240	136222	1 hour	select * from nccp_log where dateint>=20150601 and dateint<=20150610 limit 10;

The problem is that TableReader constructs a separate HadoopRDD per Hive partition path and group them into a UnionRDD. Then, all the input files are listed sequentially. In other tools such as Hive and Pig, this can be solved by setting mapreduce.input.fileinputformat.list-status.num-threads high. But in Spark, since each HadoopRDD lists only one partition path, setting this property doesn't help.

Attachments

Issue Links

is depended upon by

SPARK-10340 Use S3 bulk listing for S3-backed Hive tables

Resolved

is duplicated by

SPARK-10340 Use S3 bulk listing for S3-backed Hive tables

Resolved

is related to

HADOOP-12810 FileSystem#listLocatedStatus causes unnecessary RPC calls

Closed

links to

[Github] Pull Request #8512 (piaozhexiu)

[Github] Pull Request #11242 (rdblue)

Activity

People

Assignee:: Ryan Blue

Reporter:: Cheolsoo Park

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Aug/15 01:40

Updated:: 05/May/16 21:41

Resolved:: 05/May/16 21:41