Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2663

Filter out tuples with empty collection slots in scan.

    XMLWordPrintableJSON

Details

    Description

      For queries that reference nested collections, we assign predicates that reference fields of the nested collection directly in the corresponding parent scan. Those predicates affect how many items are materialized inside a collection-typed slot. There's a good chance that such predicates result in empty collections, but we currently do not filter out the containing tuple if there are such empty collections. Consider this query and its plan:

      Query:
      select c_custkey, o_orderkey
      from tpch_nested_parquet.customer c, c.c_orders
      where o_orderkey = 1884930
      
      Plan:
      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=176.00MB VCores=1                          |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | tpch_nested_parquet.customer                                                       |
      |                                                                                    |
      | 05:EXCHANGE [UNPARTITIONED]                                                        |
      | |                                                                                  |
      | 01:SUBPLAN                                                                         |
      | |                                                                                  |
      | |--04:NESTED LOOP JOIN [CROSS JOIN]                                                |
      | |  |                                                                               |
      | |  |--02:SINGULAR ROW SRC                                                          |
      | |  |                                                                               |
      | |  03:UNNEST [c.c_orders]                                                          |
      | |                                                                                  |
      | 00:SCAN HDFS [tpch_nested_parquet.customer c]                                      |
      |    partitions=1/1 files=4 size=554.13MB                                            |
      |    predicates on c_orders: o_orderkey = 1884930                                    |
      +------------------------------------------------------------------------------------+
      

      In the query above, the scan returns one row for every customer. It is clear though that only few customer rows will result in any output from the subplan. Avoiding subplan iterations is important because they are expensive even when unnesting empty collections (due to Reset() of the plan tree).

      For nested TPCH Q12 on my dev setup this improvement reduced the number of rows returned from the scan by >100x, from 1.5M to 13K.

      It seems that the following nested TPCH queries could also benefit from this improvement:
      Q3, Q4, Q5, Q7, Q8, Q10, Q12, Q13, Q21

      When considering this optimization, special consideration must be given to outer or semi joined nested collections because then applying this optimization may not be correct.

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: