Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13700

Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • None
    • None
    • Spark Core

    Description

      Spark is great for synchronous operations.

      But sometimes I need to call a database/web server/etc from my transform, and the Spark pipeline stalls waiting for it.

      Avoiding that would be great!

      I suggest we add a new method RDD.mapAsync(), which can execute these operations concurrently, avoiding the bottleneck.

      I've written a quick'n'dirty implementation of what I have in mind:
      https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3

      What do you think?

      If you agree with this feature, I can work on a pull request.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            paulo_raca Paulo Costa
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment