Description
Running SparkR unit tests randomly has the following error:
Failed -------------------------------------------------------------------------
1. Error: pipeRDD() on RDDs (@test_rdd.R#428) ----------------------------------
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 (TID 1493, localhost): org.apache.spark.SparkException: R computation failed with
[1] 1
[1] 1
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
ignoring SIGPIPE signal
Calls: source ... <Anonymous> -> lapply -> lapply -> FUN -> writeRaw -> writeBin
Execution halted
cannot open the connection
Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is related to daemon R worker mode. By default, SparkR launches an R daemon worker per executor, and forks R workers from the daemon when necessary.
The problem about forking R worker is that all forked R processes share a temporary directory, as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
When any forked R worker exits either normally or caused by errors, the cleanup procedure of R will delete the temporary directory. This will affect the still-running forked R workers because any temporary files created by them under the temporary directories will be removed together. Also all future R workers that will be forked from the daemon will be affected if they use tempdir() or tempfile() to get tempoaray files because they will fail to create temporary files under the already-deleted session temporary directory.
So in order for the daemon mode to work, this problem should be circumvented. In current dameon.R, R workers directly exits skipping the cleanup procedure of R so that the shared temporary directory won't be deleted.
source(script) # Set SIGUSR1 so that child can exit tools::pskill(Sys.getpid(), tools::SIGUSR1) parallel:::mcexit(0L)
However, this is a bug in daemon.R, that when there is any execution error in R workers, the error handling of R will finally go into the cleanup procedure. So try() should be used in daemon.R to catch any error in R workers, so that R workers will directly exit.
try(source(script))
# Set SIGUSR1 so that child can exit
tools::pskill(Sys.getpid(), tools::SIGUSR1)
parallel:::mcexit(0L)
Attachments
Issue Links
- is duplicated by
-
SPARK-16300 Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory
- Resolved
- links to