Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7061

Case Classes Cannot be Repartitioned/Shuffled in Spark REPL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 1.2.1
    • None
    • Spark Shell
    • None
    • Single Node Stand Alone Spark Shell

    Description

      Running the following code in the spark shell against a stand alone master.

      case class CustomerID( id:Int)
      sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1)
      

      Gives the following exception

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	at java.lang.Class.forName0(Native Method)
      	at java.lang.Class.forName(Class.java:274)
      	at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
      	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
      	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
      	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
      	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
      	at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
      	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
      	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:56)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      I believe this is related to the shuffle code since the following other examples also give this exception.

      val idsOfInterest = sc.parallelize(1 to 1000).map(CustomerID(_)).groupBy(_.id).take(1)
      val idsOfInterest = sc.parallelize(1 to 1000).map( x => (CustomerID(_),x)).groupByKey().take(1)
      val idsOfInterest = sc.parallelize(1 to 1000).map( x => (CustomerID(_),x)).reduceByKey((x,y) => x+y).take(1)
      

      But these functions do not

      sc.parallelize(1 to 1000).map(CustomerID(_)).reduce( (x,y) => CustomerID(x.id+y.id) )
      sc.parallelize(1 to 1000).map(CustomerID(_)).map( x=> CustomerID(x.id+5) ).take(1)
      

      All of these examples work in application code and when the shell is run in Local mode.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rspitzer Russell Spitzer
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: