Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15726

[C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • C++

    Description

      The following query should read a single column from the target parquet file.

      open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% collect()
      

      Furthermore, it should apply a pushdown filter to the source node allowing parquet row groups to potentially filter out target data.

      At the moment it creates the following exec plan:

      3:SinkNode{}
        2:ProjectNode{projection=[l_tax]}
          1:FilterNode{filter=(l_tax < 0.01)}
            0:SourceNode{}
      

      There is no projection or filter in the source node. As a result we end up reading much more data from disk (the entire file) than we need to (at most a single column).

      This could be fixed via heuristics in the dplyr code. However, it may quickly get complex (for example, the project comes after the filter, so you need to make sure you push down a projection that includes both the columns accessed by the filter and the columns accessed by the projection OR can you push down the projection through a join [yes you can], how do you know which columns to apply to which source node).

      A more complete fix would be to call into some kind of 3rd party optimizer (e.g. calcite) after the plan has been created by dplyr but before it is passed to the execution engine.

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h