Today, we fully materialize a row before evaluating the top-level conjuncts when scanning Parquet. This includes materializing nested collections. We should avoid materializing nested collections if top-level conjuncts already discard the row. Our recent move to column-wise materialization makes this improvement feasible (
To illustrate the problem, consider this query:
Even though we have a very selective predicate on the top-level customer, our scanner will still fully materialize all orders of all customers. The non-matches will be filtered, but we still pay the cost of materializing the orders.
The proposed improvement is to avoid materializing the orders of non-qualifying customers.
The improvement will several things:
- Analyze and separate the top-level conjuncts into those that can be evaluated before materializing the nested collections and those that require nested collections to be materialized. In particular, we need to be careful with our auto-generated !empty() predicates on nested collections.
- Add a new SkipValues() or similar interface to the Parquet column readers to advances the scanner without actually materializing values. If possible, we should skip entire blocks.