Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10722

Uncaught exception: RDDBlockId not found in driver-heartbeater

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.0
    • 1.6.2
    • Block Manager, Spark Core
    • None

    Description

      Some operations involving cached RDDs generate an uncaught exception in driver-heartbeater. If the .cache() call is removed, processing happens without the exception. However, not all RDDs trigger the problem, i.e., some .cache() operations are fine.

      I can see the problem with 1.4.1 and 1.5.0 but I have not been able to create a reproducible test case. The same exception is reported on SO for v1.3.1 but the behavior is related to large broadcast variables.

      The full stack trace is:

      15/09/20 22:10:08 ERROR Utils: Uncaught exception in thread driver-heartbeater
      java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
        at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
        at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
        at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
        at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
        at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
      Caused by: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
        at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
        ... 33 more
      

      Attachments

        Issue Links

          Activity

            michaelmalak Michael Malak added a comment -

            I have seen this in a small Hello World type program compiled and run from sbt that reads a large text file and calls .cache(). But if instead I do sbt package and then spark-submit (instead of just sbt run), it works. That suggests there may be some dependency omitted from Artifactory for spark-core but that is in spark-assembly.

            This link suggests slf4j-simple.jar, but adding that to my .sbt didn't help.
            https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-Exception-in-thread-quot-main-quot-java-lang/td-p/19544

            Googling, it seems the problem is more commonly encountered while running unit tests during the build of Spark itself.

            michaelmalak Michael Malak added a comment - I have seen this in a small Hello World type program compiled and run from sbt that reads a large text file and calls .cache(). But if instead I do sbt package and then spark-submit (instead of just sbt run), it works. That suggests there may be some dependency omitted from Artifactory for spark-core but that is in spark-assembly. This link suggests slf4j-simple.jar, but adding that to my .sbt didn't help. https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-Exception-in-thread-quot-main-quot-java-lang/td-p/19544 Googling, it seems the problem is more commonly encountered while running unit tests during the build of Spark itself.
            codepitbull Jochen Mader added a comment -

            Just stumbled on this issue:
            We ran into the same issue on Spark 1.6.0.
            It instantly went away after we had switched to the Kryp-Serializer (which you should be using anyway).
            Looks like a bug in the JavaSerializer to me.

            The fix for us:

            new SparkConf().setAppName("Test Application").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
            
            codepitbull Jochen Mader added a comment - Just stumbled on this issue: We ran into the same issue on Spark 1.6.0. It instantly went away after we had switched to the Kryp-Serializer (which you should be using anyway). Looks like a bug in the JavaSerializer to me. The fix for us: new SparkConf().setAppName( "Test Application" ).set( "spark.serializer" , "org.apache.spark.serializer.KryoSerializer" )

            I just ran into the same issue (although it was the first time), but I am already using the KryoSerializer... I a running in local mode and effectively caching some RDDs.

            pvcnt Vincent Primault added a comment - I just ran into the same issue (although it was the first time), but I am already using the KryoSerializer... I a running in local mode and effectively caching some RDDs.
            codepitbull Jochen Mader added a comment -

            Just out of curiosity: Are you running 1.6 or a different version?

            codepitbull Jochen Mader added a comment - Just out of curiosity: Are you running 1.6 or a different version?

            Yes, I forgot to mention, I am running Spark 1.6.0.

            pvcnt Vincent Primault added a comment - Yes, I forgot to mention, I am running Spark 1.6.0.

            Moreover, the problematic part is the following one:
            16/02/18 18:10:04 ERROR o.a.s.s.TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 134974 ms

            I believe it should be related and it causes my application to shutdown...

            pvcnt Vincent Primault added a comment - Moreover, the problematic part is the following one: 16/02/18 18:10:04 ERROR o.a.s.s.TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 134974 ms I believe it should be related and it causes my application to shutdown...
            simonscottuk Simon Scott added a comment -

            We too have experienced this exact exception and the resulting "Lost executor" error as described by Vincent Primault.

            We are using Spark 1.5.1 and the KryoSerializer.

            So the good news is that I believe I have identified a probable cause of the exception. I have rebuilt the spark-core jar with a fix and the issue appears to be resolved. I say "appears" because I need guidance on how to build a reproducible test-case that provokes the issue and demonstrates the success of any fix. Suffice to say that our nightly integration test, which was failing due to this issue, has now run successfully for several days. So I thought it was time to share my findings.

            Examining the exception stack trace leads us to the "Executor.reportHeartbeat" method. This method is run regularly by a ScheduledThreadPoolExecutor. Given the essentially random occurrences of this exception, it seems reasonable to assume that the exception happens when whichever thread of the pool is running the reportHeartbeat method is insufficiently configured. Again looking at the stack trace, the "deserialize" at line 440 of Executor.scala is failing to load the RDDBlockId class - so the failing thread is not configured with the correct class loader?

            So the fix I have applied is to change the Utils.deserialize call to call instead the Utils.deserialize override that takes a second argument which is the class loader to use. Helpfully Utils also provides "getContextOrSparkClassLoader" which seems to have a good enough value to resolve the issue.

            So I hope that helps. I would like to put forward a patch with my fix, the only thing holding me back is lack of a reproducible test-case. As I said, any guidance on how to generate such warmly received.

            simonscottuk Simon Scott added a comment - We too have experienced this exact exception and the resulting "Lost executor" error as described by Vincent Primault. We are using Spark 1.5.1 and the KryoSerializer. So the good news is that I believe I have identified a probable cause of the exception. I have rebuilt the spark-core jar with a fix and the issue appears to be resolved. I say "appears" because I need guidance on how to build a reproducible test-case that provokes the issue and demonstrates the success of any fix. Suffice to say that our nightly integration test, which was failing due to this issue, has now run successfully for several days. So I thought it was time to share my findings. Examining the exception stack trace leads us to the "Executor.reportHeartbeat" method. This method is run regularly by a ScheduledThreadPoolExecutor. Given the essentially random occurrences of this exception, it seems reasonable to assume that the exception happens when whichever thread of the pool is running the reportHeartbeat method is insufficiently configured. Again looking at the stack trace, the "deserialize" at line 440 of Executor.scala is failing to load the RDDBlockId class - so the failing thread is not configured with the correct class loader? So the fix I have applied is to change the Utils.deserialize call to call instead the Utils.deserialize override that takes a second argument which is the class loader to use. Helpfully Utils also provides "getContextOrSparkClassLoader" which seems to have a good enough value to resolve the issue. So I hope that helps. I would like to put forward a patch with my fix, the only thing holding me back is lack of a reproducible test-case. As I said, any guidance on how to generate such warmly received.
            simonscottuk Simon Scott added a comment - - edited

            Should perhaps add that the exception in the heartbeat reporter causes it not to be scheduled again. So heartbeats are no longer sent by the executor, which causes the driver to believe that the executor is lost....

            simonscottuk Simon Scott added a comment - - edited Should perhaps add that the exception in the heartbeat reporter causes it not to be scheduled again. So heartbeats are no longer sent by the executor, which causes the driver to believe that the executor is lost....
            apachespark Apache Spark added a comment -

            User 'simonjscott' has created a pull request for this issue:
            https://github.com/apache/spark/pull/13222

            apachespark Apache Spark added a comment - User 'simonjscott' has created a pull request for this issue: https://github.com/apache/spark/pull/13222
            srowen Sean R. Owen added a comment -

            Issue resolved by pull request 13222
            https://github.com/apache/spark/pull/13222

            srowen Sean R. Owen added a comment - Issue resolved by pull request 13222 https://github.com/apache/spark/pull/13222

            People

              simonscottuk Simon Scott
              simeons Simeon Simeonov
              Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: