Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The following query should read a single column from the target parquet file.
open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% collect()
Furthermore, it should apply a pushdown filter to the source node allowing parquet row groups to potentially filter out target data.
At the moment it creates the following exec plan:
3:SinkNode{} 2:ProjectNode{projection=[l_tax]} 1:FilterNode{filter=(l_tax < 0.01)} 0:SourceNode{}
There is no projection or filter in the source node. As a result we end up reading much more data from disk (the entire file) than we need to (at most a single column).
This could be fixed via heuristics in the dplyr code. However, it may quickly get complex (for example, the project comes after the filter, so you need to make sure you push down a projection that includes both the columns accessed by the filter and the columns accessed by the projection OR can you push down the projection through a join [yes you can], how do you know which columns to apply to which source node).
A more complete fix would be to call into some kind of 3rd party optimizer (e.g. calcite) after the plan has been created by dplyr but before it is passed to the execution engine.
Attachments
Issue Links
- links to