Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9516

[Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

    XMLWordPrintableJSON

Details

    Description

      This issue covers three main topics that IMO are addressed as a whole in a refactor of the physical plans and expressions in data fusion. The underlying issues that justify this particular ticket:

      We currently assign poor names to the output schema.

      Specifically, most names are given based on the last expression's name. Example: SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c yields the fields names "c, SUM, SUM".

      We currently derive the column names from physical expressions, not logical expressions

      This implies that logical expressions that perform multiple operations (e.g. an grouped aggregation that performs partitioned aggregations + merge + final aggregation) have their name derived from their physical declaration, not logical. IMO a physical plan is an execution plan and is thus not concerned with naming. It is the logical plan that should be concerned with naming. Conceptually, a given logical plan can have more than one physical plan, e.g. depending on the execution environment (e.g. locally vs distributed).

      We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers.

      More details here. In summary, it is possible to remove one of the optimizers and significantly simplify the other if columns do not carry indexing information.

      Proposal

      I propose that we:

      drop physical_plan::expressions::Column::index

      This is a major simplification of the code, and allow us to just ignore the position of the statement on the schema, and instead focus on its name. This is overall a simplification because it allow us to treat columns based solely on their names, and not on their position in the schema. Since SQL does not care about the position of the column on the table anyway (we currently already take the first column with that name), this seems natural.

      I already prototyped this here.

      The main conclusion of this prototype is that this feasible as long as all our expressions get assigned a unique name, which is against what we currently offer (see example above). This leads me to:

      drop physical_plan::PhysicalExpr::name()

      Currently, the name of an expression is derived from its physical plan. However, some operations' names are required to be known before its physical representation. The example I found in our current code is the grouped aggregation described above. If we were to build the name of our aggregation based on its physical plan, the name of a "COUNT(a)" operation would be SUM(COUNT(a)) because, in the physical plan we first count on each partition, then merge, and them sum the counts over all partitions.

      Fundamentally, IMO the issue here is that we are mixing responsibilities: the physical plan should not care about naming, because the physical plan corresponds to an execution plan, not a logical description of the column (its name). This leads me to:

      add logicalplan::Expr::name(&self, input_schema: &Schema)

      This will rerturn the name of this expression, that will naturally depend on its variation. Its implementation will be based on our current code for physical_plan::PhysicalExpr::name().

      I can take this work, but before committing, would like to know your thoughts about this. My initial prototyping indicate that all of this is possible and greatly simplifies the code, but I may be missing a design aspect of this.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorgecarleitao Jorge Leitão
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 40m
                  4h 40m