Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13525

SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • SparkR

    Description

      I am following the code steps from this example:
      https://spark.apache.org/docs/1.6.0/sparkr.html

      There are multiple issues:
      1. The head and summary and filter methods are not overridden by spark. Hence I need to call them using `SparkR::` namespace.
      2. When I try to execute the following, I get errors:

      $> $R_HOME/bin/R
      
      R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
      Copyright (C) 2015 The R Foundation for Statistical Computing
      Platform: x86_64-pc-linux-gnu (64-bit)
      
      R is free software and comes with ABSOLUTELY NO WARRANTY.
      You are welcome to redistribute it under certain conditions.
      Type 'license()' or 'licence()' for distribution details.
      
        Natural language support but running in an English locale
      
      R is a collaborative project with many contributors.
      Type 'contributors()' for more information and
      'citation()' on how to cite R or R packages in publications.
      
      Type 'demo()' for some demos, 'help()' for on-line help, or
      'help.start()' for an HTML browser interface to help.
      Type 'q()' to quit R.
      
      
      Welcome at Fri Feb 26 16:19:35 2016 
      
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:base’:
      
          colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
          summary, transform
      
      Launching java with spark-submit command /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
      > df <- createDataFrame(sqlContext, iris)
      Warning messages:
      1: In FUN(X[[i]], ...) :
        Use Sepal_Length instead of Sepal.Length  as column name
      2: In FUN(X[[i]], ...) :
        Use Sepal_Width instead of Sepal.Width  as column name
      3: In FUN(X[[i]], ...) :
        Use Petal_Length instead of Petal.Length  as column name
      4: In FUN(X[[i]], ...) :
        Use Petal_Width instead of Petal.Width  as column name
      > training <- filter(df, df$Species != "setosa")
      Error in filter(df, df$Species != "setosa") : 
        no method for coercing this S4 class to a vector
      > training <- SparkR::filter(df, df$Species != "setosa")
      > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
      16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      java.net.SocketTimeoutException: Accept timed out
              at java.net.PlainSocketImpl.socketAccept(Native Method)
              at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
              at java.net.ServerSocket.implAccept(ServerSocket.java:530)
              at java.net.ServerSocket.accept(ServerSocket.java:498)
              at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
              at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
              at org.apache.spark.scheduler.Task.run(Task.scala:81)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      16/02/26 16:26:46 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
      16/02/26 16:26:46 ERROR RBackendHandler: fitRModelFormula on org.apache.spark.ml.api.r.SparkRWrappers failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
        org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.net.SocketTimeoutException: Accept timed out
              at java.net.PlainSocketImpl.socketAccept(Native Method)
              at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
              at java.net.ServerSocket.implAccept(ServerSocket.java:530)
              at java.net.ServerSocket.accept(ServerSocket.java:498)
              at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
              at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      > 
      

      Even, when I try to run the head command on the dataframe, I get similar error:

      > SparkR::head(df)
      16/02/26 16:32:05 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 2)
      java.net.SocketTimeoutException: Accept timed out
              at java.net.PlainSocketImpl.socketAccept(Native Method)
              at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
              at java.net.ServerSocket.implAccept(ServerSocket.java:530)
              at java.net.ServerSocket.accept(ServerSocket.java:498)
              at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
              at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
              at org.apache.spark.scheduler.Task.run(Task.scala:81)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      16/02/26 16:32:05 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
      16/02/26 16:32:05 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
        org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 2, localhost): java.net.SocketTimeoutException: Accept timed out
              at java.net.PlainSocketImpl.socketAccept(Native Method)
              at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
              at java.net.ServerSocket.implAccept(ServerSocket.java:530)
              at java.net.ServerSocket.accept(ServerSocket.java:498)
              at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
              at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      

      I have a .Rprofile file in my directory which looks like the following:

      .Rprofile
      # Sample Rprofile.site file 
      
      # Things you might want to change
      .First <- function(){
        cat("\nWelcome at", date(), "\n") 
        SPARK_HOME <- "/content/user/SOFTWARE/spark"
        .libPaths(c(file.path(SPARK_HOME, "R", "lib"), .libPaths()))
        library(SparkR)
        sc <<- sparkR.init(master="local[20]", appName="Model SparkR", sparkHome=SPARK_HOME,
          sparkEnvir=list(spark.local.dir="./tmp",
          spark.executor.memory="50g",
          spark.driver.maxResultSize="50g",
          spark.driver.memory="50g"))
        sqlContext <<- sparkRSQL.init(sc)
      }
      
      .Last <- function(){ 
        cat("\nGoodbye at ", date(), "\n")
      }
      
      

      I am using the master branch of Spark since the following commit:

      commit 35316cb0b744bef9bcb390411ddc321167f953be
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      Date:   Thu Feb 25 13:29:10 2016 -0800
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            shubhanshumishra@gmail.com Shubhanshu Mishra
            Shivaram Venkataraman Shivaram Venkataraman
            Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: