The goal of ARROW-16860 and in general one of the goals of the Substrait consumer effort thus far, is to enable round-tripping between Substrait and Acero plans. However, this begs the question what constitutes round-tripping: are we talking about a perfect reproduction of a Substrait plan after converting it to and from Acero (and/or vice-versa?), or we just talking about functionally-equivalent plans, or is it something in between?
This is kind of a rhethorical question because I think it depends on the use case. We've been doing the former thus far to help prove correctness, but this has various problems. For example:
- Substrait plans contain meaningless information that cannot be represented in Acero, such as the order in which extensions are defined or the anchors used to refer to them. Plans are functionally and structurally indistinguishable even if this information is lost.
- Protobuf itself also contains meaningless information, because the order in which fields are defined on the wire is undefined, and not even consistent between serializations (hence the existence of CheckMessagesEquivalent).
- (I'm guessing) Acero plans also contain functionally meaningless information (like intermediate column names) that Substrait cannot represent, at least not without advanced extensions.
- The Substrait and Arrow type systems are quite different; tracking the conversion between them in a way that loses no (meta-)information is difficult. For example, Acero always encodes field names in schemas, while Substrait only does this at the input and output.
- Substrait and Acero deal with projections and expressions in fundamentally different ways (see ARROW-16986).
The approach thus far has been to just reject an incoming plan if it contains something that can't be round-tripped exactly (at least according to CheckMessagesEquivalent), but this behavior is far too pedantic to be useful in practice, since it rejects perfectly valid and executable plans. For example, optimizations (hints) specified in advanced extensions can be freely ignored but are currently rejected.
Rather than trying to answer this question, I'd suggest adding a method to specify conversion options. Initially I suggest a single enum with the following variants:
- pedantic conversion: reject plans that are known to not round-trip even if they are valid.
- structure-preserving conversion: accept plans even if they won't round-trip, but preserve the relation structure of the incoming plan completely.
- best-effort conversion: accept plans even if they won't round-trip, and avoid regressions in terms of optimality of the plan caused by the conversion, potentially changing the relation structure, thus allowing for "optimizations" like ARROW-16986.
The enum should be in an options struct though, so more options can be added later without having to add more arguments to the conversion functions (see also ARROW-16987).