Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17463

[R] Avoid unnecessary projections

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 10.0.0
    • R

    Description

      In ExecPlan$Build(), we call Project in a few places, and there is code to make sure that there is at least one ProjectNode in the query in order to remove augmented fields from a Dataset scan (unless the user has added them). As a result, it is possible to get multiple ProjectNodes in a row that are essentially no-op. One example is with grouped aggregation: there is a projection to get the order of the columns back to what R expects, and then a no-op projection after that:

      > mtcars |> arrow_table() |> count(cyl) |> explain()
      ExecPlan with 6 nodes:
      5:SinkNode{}
        4:ProjectNode{projection=[cyl, n]}
          3:ProjectNode{projection=[cyl, n]}
            2:GroupByNode{keys=["cyl"], aggregates=[
            	hash_sum(n, {skip_nulls=true, min_count=1}),
            ]}
              1:ProjectNode{projection=["n": 1, cyl]}
                0:TableSourceNode{}
      

      IDK how significant of a performance impact this would have, but it certainly looks wasteful and should be avoidable.

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h