Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Impala 2.10.0
    • Backend

    Description

      select count(*) from parquet_table;
      select count(*) from parquet_table group by partition_col;
      

      Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:

      Execution change
      Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups

      Plan change
      The count local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
      The final distributed plan will be:
      scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))

      This optimization is applicable where is only a count and there are no scan predicates.

      Attachments

        Issue Links

          Activity

            People

              tarasbob Taras Bobrovytsky
              alex.behm Alexander Behm
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: