[SPARK-13159] External shuffle service broken w/ Mesos - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Mesos
Labels:
None

Description

Dynamic allocation and external shuffle service won't work together on Mesos for applications longer than spark.network.timeout.

After two minutes (default value for spark.network.timeout), I see a lot of FileNotFoundExceptions and spark jobs just fail.

16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 10.0.1.208): java.io.FileNotFoundException: /tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0 (No such file or directory)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
	at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
...

Analysis

The Mesos external shuffle service needs a way to know when it's safe to delete shuffle files for a given application. The current solution (that seemed to work fine while the RPC transport was based on Akka) was to open a TCP connection between the driver and each external shuffle service. Once the driver went down (graciously or crashed), the shuffle service would eventually get a notification from the network layer, and delete the corresponding files.

This solution stopped working because it relies on an idle connection, and the new Netty-based RPC layer is closing the connection after spark.network.timeout.

Attachments

Issue Links

duplicates

SPARK-12583 spark shuffle fails with mesos after 2mins

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dragos Dascalita Haut

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Feb/16 14:46

Updated:: 03/Feb/16 15:30

Resolved:: 03/Feb/16 15:30