Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
In ExecPlan$Build(), we call Project in a few places, and there is code to make sure that there is at least one ProjectNode in the query in order to remove augmented fields from a Dataset scan (unless the user has added them). As a result, it is possible to get multiple ProjectNodes in a row that are essentially no-op. One example is with grouped aggregation: there is a projection to get the order of the columns back to what R expects, and then a no-op projection after that:
> mtcars |> arrow_table() |> count(cyl) |> explain() ExecPlan with 6 nodes: 5:SinkNode{} 4:ProjectNode{projection=[cyl, n]} 3:ProjectNode{projection=[cyl, n]} 2:GroupByNode{keys=["cyl"], aggregates=[ hash_sum(n, {skip_nulls=true, min_count=1}), ]} 1:ProjectNode{projection=["n": 1, cyl]} 0:TableSourceNode{}
IDK how significant of a performance impact this would have, but it certainly looks wasteful and should be avoidable.
Attachments
Issue Links
- links to