Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13159

External shuffle service broken w/ Mesos

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • None
    • Mesos
    • None

    Description

      Dynamic allocation and external shuffle service won't work together on Mesos for applications longer than spark.network.timeout.

      After two minutes (default value for spark.network.timeout), I see a lot of FileNotFoundExceptions and spark jobs just fail.

      16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 10.0.1.208): java.io.FileNotFoundException: /tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0 (No such file or directory)
      	at java.io.FileOutputStream.open(Native Method)
      	at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
      	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
      	at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
      	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
      	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
      ...
      

      Analysis

      The Mesos external shuffle service needs a way to know when it's safe to delete shuffle files for a given application. The current solution (that seemed to work fine while the RPC transport was based on Akka) was to open a TCP connection between the driver and each external shuffle service. Once the driver went down (graciously or crashed), the shuffle service would eventually get a notification from the network layer, and delete the corresponding files.

      This solution stopped working because it relies on an idle connection, and the new Netty-based RPC layer is closing the connection after spark.network.timeout.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dragos Dragos Dascalita Haut
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: