Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Fix Version/s: Impala 2.10.0
    • Component/s: Backend

      Description

      select count(*) from parquet_table;
      select count(*) from parquet_table group by partition_col;
      

      Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:

      Execution change
      Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups

      Plan change
      The count local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
      The final distributed plan will be:
      scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))

      This optimization is applicable where is only a count and there are no scan predicates.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tarasbob Taras Bobrovytsky
                Reporter:
                alex.behm Alexander Behm
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: