Description
In Spark SQL, short queries like select * from table limit 10 run very slowly against partitioned Hive tables because of file listing. In particular, if a large number of partitions are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks in my environment-
- Parquet-backed Hive table
- Partitioned by dateint and hour
- Stored on S3
# of partitions | # of files | runtime | query |
---|---|---|---|
1 | 972 | 30 secs | select * from nccp_log where dateint=20150601 and hour=0 limit 10; |
24 | 13646 | 6 mins | select * from nccp_log where dateint=20150601 limit 10; |
240 | 136222 | 1 hour | select * from nccp_log where dateint>=20150601 and dateint<=20150610 limit 10; |
The problem is that TableReader constructs a separate HadoopRDD per Hive partition path and group them into a UnionRDD. Then, all the input files are listed sequentially. In other tools such as Hive and Pig, this can be solved by setting mapreduce.input.fileinputformat.list-status.num-threads high. But in Spark, since each HadoopRDD lists only one partition path, setting this property doesn't help.
Attachments
Issue Links
- is depended upon by
-
SPARK-10340 Use S3 bulk listing for S3-backed Hive tables
- Resolved
- is duplicated by
-
SPARK-10340 Use S3 bulk listing for S3-backed Hive tables
- Resolved
- is related to
-
HADOOP-12810 FileSystem#listLocatedStatus causes unnecessary RPC calls
- Closed
- links to