Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
After looking at the ExecPlan output of some queries, it jumped out at me how we translate {{ int_field == 5 }} in R as {{ cast(int_field, float64) == 5 }} because 5 is a double in R.
This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter.
> system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> compute()) user system elapsed 0.391 0.024 0.362 > system.time(ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> compute()) user system elapsed 0.206 0.025 0.179
You can see the difference in the query plans too:
> ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain() ExecPlan with 4 nodes: 3:SinkNode{} 2:ProjectNode{projection=[passenger_count]} 1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)} 0:SourceNode{} > ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> explain() ExecPlan with 4 nodes: 3:SinkNode{} 2:ProjectNode{projection=[passenger_count]} 1:FilterNode{filter=(passenger_count > 10)} 0:SourceNode{}
Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R.
Attachments
Issue Links
- fixes
-
ARROW-18244 [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types
- Resolved
-
ARROW-13938 [C++] Date and datetime types should autocast from strings
- Resolved
- links to