[SPARK-24947] aggregateAsync and foldAsync for RDD - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

AsyncRDDActions contains collectAsync, countAsync, foreachAsync, etc; but it doesn't provide general mechanisms for reducing datasets asynchronously. If I want to aggregate some statistics on a large dataset and it's going to take an hour, I shouldn't need to completely block a thread for the hour to wait for the result.

I propose the following methods be added to AsyncRDDActions:

def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): FutureAction[U]

def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]

Locally I have a version of aggregateAsync implemented based on submitJob (similar to how countAsync is implemented), and a foldAsync implementation that simply delegates through to aggregateAsync. I haven't yet written unit tests for these, but I can do so if this is a contribution that would be accepted. Please let me know.

Attachments

Issue Links

links to

[Github] Pull Request #21971 (ceedubs)

GitHub Pull Request #21971

Activity

People

Assignee:: Unassigned

Reporter:: Cody Allen

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Jul/18 14:24

Updated:: 25/May/21 01:51

Resolved:: 25/May/21 01:40