[ARROW-9516] [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Rust - DataFusion
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25584

Description

This issue covers three main topics that IMO are addressed as a whole in a refactor of the physical plans and expressions in data fusion. The underlying issues that justify this particular ticket:

We currently assign poor names to the output schema.

Specifically, most names are given based on the last expression's name. Example: SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c yields the fields names "c, SUM, SUM".

We currently derive the column names from physical expressions, not logical expressions

This implies that logical expressions that perform multiple operations (e.g. an grouped aggregation that performs partitioned aggregations + merge + final aggregation) have their name derived from their physical declaration, not logical. IMO a physical plan is an execution plan and is thus not concerned with naming. It is the logical plan that should be concerned with naming. Conceptually, a given logical plan can have more than one physical plan, e.g. depending on the execution environment (e.g. locally vs distributed).

We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers.

More details here. In summary, it is possible to remove one of the optimizers and significantly simplify the other if columns do not carry indexing information.

Proposal

I propose that we:

drop `physical_plan::expressions::Column::index`

This is a major simplification of the code, and allow us to just ignore the position of the statement on the schema, and instead focus on its name. This is overall a simplification because it allow us to treat columns based solely on their names, and not on their position in the schema. Since SQL does not care about the position of the column on the table anyway (we currently already take the first column with that name), this seems natural.

I already prototyped this here.

The main conclusion of this prototype is that this feasible as long as all our expressions get assigned a unique name, which is against what we currently offer (see example above). This leads me to:

drop `physical_plan::PhysicalExpr::name()`

Currently, the name of an expression is derived from its physical plan. However, some operations' names are required to be known before its physical representation. The example I found in our current code is the grouped aggregation described above. If we were to build the name of our aggregation based on its physical plan, the name of a "COUNT(a)" operation would be SUM(COUNT(a)) because, in the physical plan we first count on each partition, then merge, and them sum the counts over all partitions.

Fundamentally, IMO the issue here is that we are mixing responsibilities: the physical plan should not care about naming, because the physical plan corresponds to an execution plan, not a logical description of the column (its name). This leads me to:

add `logicalplan::Expr::name(&self, input_schema: &Schema)`

This will rerturn the name of this expression, that will naturally depend on its variation. Its implementation will be based on our current code for physical_plan::PhysicalExpr::name().

I can take this work, but before committing, would like to know your thoughts about this. My initial prototyping indicate that all of this is possible and greatly simplifies the code, but I may be missing a design aspect of this.

Attachments

Issue Links

links to

GitHub Pull Request #7796

Activity

People

Assignee:: Unassigned

Reporter:: Jorge Leitão

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jul/20 21:39

Updated:: 11/Jan/23 11:07

Resolved:: 25/Jul/20 03:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

Details

Description

We currently assign poor names to the output schema.

We currently derive the column names from physical expressions, not logical expressions

We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers.

Proposal

drop `physical_plan::expressions::Column::index`

drop `physical_plan::PhysicalExpr::name()`

add `logicalplan::Expr::name(&self, input_schema: &Schema)`

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment

Details

Description

We currently assign poor names to the output schema.

We currently derive the column names from physical expressions, not logical expressions

We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers.

Proposal

drop physical_plan::expressions::Column::index

drop physical_plan::PhysicalExpr::name()

add logicalplan::Expr::name(&self, input_schema: &Schema)

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment

drop `physical_plan::expressions::Column::index`

drop `physical_plan::PhysicalExpr::name()`

add `logicalplan::Expr::name(&self, input_schema: &Schema)`