Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16299

Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2
    • 2.0.0
    • SparkR
    • None

    Description

      Running SparkR unit tests randomly has the following error:

      Failed -------------------------------------------------------------------------
      1. Error: pipeRDD() on RDDs (@test_rdd.R#428) ----------------------------------
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 (TID 1493, localhost): org.apache.spark.SparkException: R computation failed with
      [1] 1
      [1] 1
      [1] 2
      [1] 2
      [1] 3
      [1] 3
      [1] 2
      [1] 2
      [1] 2
      [1] 2
      [1] 2
      [1] 2
      ignoring SIGPIPE signal
      Calls: source ... <Anonymous> -> lapply -> lapply -> FUN -> writeRaw -> writeBin
      Execution halted
      cannot open the connection
      Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
      In addition: Warning message:
      In file(con, "w") :
      cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
      Execution halted
      at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
      at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
      at org.apache.spark.scheduler.Task.run(Task.scala:85)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      This is related to daemon R worker mode. By default, SparkR launches an R daemon worker per executor, and forks R workers from the daemon when necessary.

      The problem about forking R worker is that all forked R processes share a temporary directory, as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
      When any forked R worker exits either normally or caused by errors, the cleanup procedure of R will delete the temporary directory. This will affect the still-running forked R workers because any temporary files created by them under the temporary directories will be removed together. Also all future R workers that will be forked from the daemon will be affected if they use tempdir() or tempfile() to get tempoaray files because they will fail to create temporary files under the already-deleted session temporary directory.

      So in order for the daemon mode to work, this problem should be circumvented. In current dameon.R, R workers directly exits skipping the cleanup procedure of R so that the shared temporary directory won't be deleted.

            source(script)
            # Set SIGUSR1 so that child can exit
            tools::pskill(Sys.getpid(), tools::SIGUSR1)
            parallel:::mcexit(0L)
      

      However, this is a bug in daemon.R, that when there is any execution error in R workers, the error handling of R will finally go into the cleanup procedure. So try() should be used in daemon.R to catch any error in R workers, so that R workers will directly exit.

            try(source(script))
            # Set SIGUSR1 so that child can exit
            tools::pskill(Sys.getpid(), tools::SIGUSR1)
            parallel:::mcexit(0L)
      

      Attachments

        Issue Links

          Activity

            People

              sunrui Sun Rui
              sunrui Sun Rui
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: