Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9678

[Rust] [DataFusion] Improve projection push down to remove unused columns

    XMLWordPrintableJSON

Details

    Description

      Currently, the projection push down only removes columns that are never referenced in the plan. However, sometimes a projection declares columns that themselves are never used.

      This issue is about improving the projection push-down to remove any column that is not logically required by the plan.

      Failing unit-test with the idea:

          #[test]
          fn table_unused_column() -> Result<()> {
              let table_scan = test_table_scan()?;
              assert_eq!(3, table_scan.schema().fields().len());
              assert_fields_eq(&table_scan, vec!["a", "b", "c"]);
      
              // we never use "b" in the first projection => remove it
              let plan = LogicalPlanBuilder::from(&table_scan)
                  .project(vec![col("c"), col("a"), col("b")])?
                  .filter(col("c").gt(&lit(1)))?
                  .project(vec![col("c"), col("a")])?
                  .build()?;
      
              assert_fields_eq(&plan, vec!["c", "a"]);
      
              let expected = "\
              Projection: #c, #a\
              \n  Selection: #c Gt Int32(1)\
              \n    Projection: #c, #a\
              \n      TableScan: test projection=Some([0, 2])";
      
              assert_optimized_plan_eq(&plan, expected);
      
              Ok(())
          }
      

      This issue was firstly identified by andygrove here.

      Attachments

        Issue Links

          Activity

            People

              jorgecarleitao Jorge Leitão
              jorgecarleitao Jorge Leitão
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h