Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27992

PySpark socket server should sync with JVM connection thread future

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.0, 2.4.1, 2.4.2, 2.4.3, 3.0.0
    • 2.4.4, 3.0.0
    • PySpark

    Description

      Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer.

      A better fix would be to allow Python to await on the serving thread future and join the thread. That way if the serving thread throws an exception, it will be propagated on the call to awaitResult.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bryanc Bryan Cutler
            bryanc Bryan Cutler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment