Details
Description
Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
E.g.,:
spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head()
This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.
Attachments
Issue Links
- is related to
-
SPARK-39084 df.rdd.isEmpty() results in unexpected executor failure and JVM crash
- Resolved
- links to
(8 links to)