Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11840

[C++][Compute] Support merging GroupByState for multithreaded aggregation

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • None
    • C++

    Description

      ARROW-11591 adds support for grouped aggregation, but defers merging (which is non-trivial and unnecessary for single threaded aggregation). Eventually it will be required to support merging, however: when aggregating in a multithreaded dataset scan, each thread's results will need to be combined after the scan is completed.

      Note that currently ScalarAggExecutor::Consume assumes that merging aggregations is not costly (true for small aggregation state as with "mean", but false for "group_by"), and invokes ScalarAggregateKernel::merge for each input batch. ARROW-11591 introduces "group_by" as a special case which will not be merged for each input batch but Ideally this assumption would not be made for any kernel. When removing it, be sure that merging other aggregates continues to be tested.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            bkietz Ben Kietzman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment