Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3841

Avoid materializing nested collections if top-level predicates already disqualify the row.

    XMLWordPrintableJSON

Details

    Description

      Today, we fully materialize a row before evaluating the top-level conjuncts when scanning Parquet. This includes materializing nested collections. We should avoid materializing nested collections if top-level conjuncts already discard the row. Our recent move to column-wise materialization makes this improvement feasible (IMPALA-2736).

      To illustrate the problem, consider this query:

      select * from customer c, c.orders o where c.id = 10
      

      Even though we have a very selective predicate on the top-level customer, our scanner will still fully materialize all orders of all customers. The non-matches will be filtered, but we still pay the cost of materializing the orders.

      The proposed improvement is to avoid materializing the orders of non-qualifying customers.

      The improvement will several things:

      • Analyze and separate the top-level conjuncts into those that can be evaluated before materializing the nested collections and those that require nested collections to be materialized. In particular, we need to be careful with our auto-generated !empty() predicates on nested collections.
      • Add a new SkipValues() or similar interface to the Parquet column readers to advances the scanner without actually materializing values. If possible, we should skip entire blocks.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alex.behm Alexander Behm
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: