Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.3, 3.1.2, 3.2.0
-
None
Description
Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket.
When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.
The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash:
1. The Python writer thread is processing a row backed by off-heap memory.
2. The task finishes, for example because it has reached a row limit.
3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread, but the writer thread does not check it immediately.
4. The child plan's task completion listener frees its off-heap memory, invalidating the row that the Python writer thread is processing.
5. The Python writer thread attempts to access the invalidated row. The use-after-free triggers a segfault that crashes the executor.
https://issues.apache.org/jira/browse/SPARK-33277 describes the same issue, but the fix was incomplete. It did not address the situation where the Python writer thread accesses a freed row.