Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21563

Race condition when serializing TaskDescriptions and adding jars

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: Scheduler, Spark Core
    • Labels:
      None

      Description

      cc Robert Kruszewski

      I was seeing this exception during some running Spark jobs:

      16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error
      java.io.EOFException: null
          at java.io.DataInputStream.readFully(DataInputStream.java:197)
          at java.io.DataInputStream.readUTF(DataInputStream.java:609)
          at java.io.DataInputStream.readUTF(DataInputStream.java:564)
          at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
          at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
          at scala.collection.immutable.Range.foreach(Range.scala:160)
          at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
          at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
          at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
          at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
          at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
          at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:748)
      

      After some debugging, we determined that this is due to a race condition in task serde. cc Imran Rashid Kay Ousterhout who last touched that code in SPARK-19796

      The race is between adding additional jars to the SparkContext and serializing the TaskDescription.

      Consider this sequence of events:

      The problem now is that the jars list is serialized as having N entries, but actually N+1 entries follow that count!

      This causes task deserialization to fail in the executor, with the stacktrace above.

      The same issue also likely exists for files, though I haven't observed that and our application does not stress that codepath the same way it did for jar additions.

      One fix here is that TaskSetManager could make an immutable copy of the jars list that it passes into the TaskDescription constructor, so that list doesn't change mid-serialization.

        Attachments

          Activity

            People

            • Assignee:
              aash Andrew Ash
              Reporter:
              aash Andrew Ash
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: