Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.2
Description
With Spark Connect + PySpark, we can stage files using `spark.addArtifacts`. When a Python UDF is executed, the working directory is set to a folder with the corresponding artifacts available.
I have observed on large scale jobs with long running tasks (>45 mins) that Spark sometimes removes that working directory, even though UDF tasks are still running. This can be seen by periodically running `os.getcwd()` in the UDF, which raises `FileNotFoundError`.
This seems to coincide with log records indicating 'Session evicted: <uuid>`, from `isolatedSessionCache`. There is a 30 minute timeout here that might be to blame.
I have not yet been able to write a simple program to reproduce. I suspect that there might be a conjunction of multiple events, such as when a task is scheduled on an executor 30 mins after the last task started. https://issues.apache.org/jira/browse/SPARK-44290 might be relevant.
cc gurwls223
Attachments
Issue Links
- links to