Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
For tables with hive-style partitions like
/table/2018/Q1 /table/2018/Q2 /table/2019/Q1 etc.
if any of the following queries is run:
select distinct dir0 from dfs.`/table`
select dir0 from dfs.`/table` group by dir0
it will actually scan every single record in the table rather than just getting a list of directories at the dir0 level. This applies even when cached metadata is available. This is a big penalty especially as the datasets grow.
To avoid such situations, a logical prune rule can be used to collect partition columns (`dir0`), either from metadata cache (if available) or group scan, and drop unnecessary files from being read. The rule will be applied on following conditions:
1) all queried columns are partitoin columns, and
2) either DISTINCT or GROUP BY operations are performed.
Attachments
Issue Links
- links to