[SPARK-27992] PySpark socket server should sync with JVM connection thread future - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 3.0.0
Fix Version/s: 2.4.4, 3.0.0
Component/s: PySpark
Labels:
- correctness

Description

Both ~~SPARK-27805~~ and ~~SPARK-27548~~ identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer.

A better fix would be to allow Python to await on the serving thread future and join the thread. That way if the serving thread throws an exception, it will be propagated on the call to awaitResult.