Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Impala 2.10.0
    • Backend

    Description

      select count(*) from parquet_table;
      select count(*) from parquet_table group by partition_col;
      

      Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:

      Execution change
      Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups

      Plan change
      The count local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
      The final distributed plan will be:
      scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))

      This optimization is applicable where is only a count and there are no scan predicates.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tarasbob Taras Bobrovytsky
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment