Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17462

[R] Cast scalars to type of field in Expression building

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 11.0.0
    • R

    Description

      After looking at the ExecPlan output of some queries, it jumped out at me how we translate {{ int_field == 5 }} in R as {{ cast(int_field, float64) == 5 }} because 5 is a double in R.

      This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter.

      > system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> compute())
         user  system elapsed 
        0.391   0.024   0.362 
      
      > system.time(ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> compute())
         user  system elapsed 
        0.206   0.025   0.179 
      

      You can see the difference in the query plans too:

      > ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain()
      ExecPlan with 4 nodes:
      3:SinkNode{}
        2:ProjectNode{projection=[passenger_count]}
          1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)}
            0:SourceNode{}
      
      > ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> explain()
      ExecPlan with 4 nodes:
      3:SinkNode{}
        2:ProjectNode{projection=[passenger_count]}
          1:FilterNode{filter=(passenger_count > 10)}
            0:SourceNode{}
      

      Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R.

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h
                  5h