Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.0.0
-
None
-
None
Description
Dynamic allocation and external shuffle service won't work together on Mesos for applications longer than spark.network.timeout.
After two minutes (default value for spark.network.timeout), I see a lot of FileNotFoundExceptions and spark jobs just fail.
16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 10.0.1.208): java.io.FileNotFoundException: /tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181) at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) ...
Analysis
The Mesos external shuffle service needs a way to know when it's safe to delete shuffle files for a given application. The current solution (that seemed to work fine while the RPC transport was based on Akka) was to open a TCP connection between the driver and each external shuffle service. Once the driver went down (graciously or crashed), the shuffle service would eventually get a notification from the network layer, and delete the corresponding files.
This solution stopped working because it relies on an idle connection, and the new Netty-based RPC layer is closing the connection after spark.network.timeout.
Attachments
Issue Links
- duplicates
-
SPARK-12583 spark shuffle fails with mesos after 2mins
- Resolved