[SPARK-16299] Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.2
Fix Version/s: 2.0.0
Component/s: SparkR
Labels:
None

Description

Running SparkR unit tests randomly has the following error:

Failed -------------------------------------------------------------------------
1. Error: pipeRDD() on RDDs (@test_rdd.R#428) ----------------------------------
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 (TID 1493, localhost): org.apache.spark.SparkException: R computation failed with
[1] 1
[1] 1
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
ignoring SIGPIPE signal
Calls: source ... <Anonymous> -> lapply -> lapply -> FUN -> writeRaw -> writeBin
Execution halted
cannot open the connection
Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This is related to daemon R worker mode. By default, SparkR launches an R daemon worker per executor, and forks R workers from the daemon when necessary.

The problem about forking R worker is that all forked R processes share a temporary directory, as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
When any forked R worker exits either normally or caused by errors, the cleanup procedure of R will delete the temporary directory. This will affect the still-running forked R workers because any temporary files created by them under the temporary directories will be removed together. Also all future R workers that will be forked from the daemon will be affected if they use tempdir() or tempfile() to get tempoaray files because they will fail to create temporary files under the already-deleted session temporary directory.

So in order for the daemon mode to work, this problem should be circumvented. In current dameon.R, R workers directly exits skipping the cleanup procedure of R so that the shared temporary directory won't be deleted.

      source(script)
      # Set SIGUSR1 so that child can exit
      tools::pskill(Sys.getpid(), tools::SIGUSR1)
      parallel:::mcexit(0L)

However, this is a bug in daemon.R, that when there is any execution error in R workers, the error handling of R will finally go into the cleanup procedure. So try() should be used in daemon.R to catch any error in R workers, so that R workers will directly exit.

      try(source(script))
      # Set SIGUSR1 so that child can exit
      tools::pskill(Sys.getpid(), tools::SIGUSR1)
      parallel:::mcexit(0L)

Attachments

Issue Links

is duplicated by

SPARK-16300 Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

Resolved

links to

[Github] Pull Request #13975 (sun-rui)

Activity

People

Assignee:: Sun Rui

Reporter:: Sun Rui

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jun/16 15:35

Updated:: 01/Jul/16 21:38

Resolved:: 01/Jul/16 21:37