Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12395

Planner overestimates scan cardinality for queries using count star optimization

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Impala 4.3.0
    • fe
    • None
    • ghx-label-7

    Description

      The scan cardinality estimate for count queries doesn't account for the fact that the count optimization only scans metadata and not the actual columns.

      Scan for a count query on Parquet store_sales:

       

      Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail -----------------------------------------------------------------------------------------------------------------------------------------------------
      00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB tpcds_3000_string_parquet_managed.store_sales
      

       

      This is a problem with all file/table formats that implement count optimizations (Parquet and also probably ORC and Iceberg).

      This problem is more serious than it was in the past because with IMPALA-12091 we now rely on scan cardinality estimates for executor group assignments so count queries are likely to get assigned to a larger executor group than needed.

      Attachments

        Activity

          People

            rizaon Riza Suminto
            drorke David Rorke
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: