Currently, when a SelectionVectorRemover receives a record batch from an upstream operator (like a Filter), it immediately starts copying over records into a new outgoing batch.
It can be worthwhile if the RecordBatch can be enriched with some additional summary statistics about the attached SelectionVector, such as
- number of records that need to be removed/copied
- total number of records in the record-batch
The benefit of this would be that in extreme cases, if all the records in a batch need to be either truncated or copies, the SelectionVectorRemover can simply drop the record-batch or simply forward it to the next downstream operator.
While the extreme cases of simply dropping the batch kind of works (because there is no overhead in copying), for cases where the record batch should pass through, the overhead remains (and is actually more than 35% of the time, if you discount for the streaming agg cost within the tests).
Here are the statistics of having such an optimization
|Selectivity||Query Time||%Time used by SVR||Time||Profile|
To summarize, the SVR should avoid creating new batches as much as possible.
A more generic (non-trivial) optimization should take into account the fact that multiple batches emitted can be coalesced, but we don't currently have test metrics for that.