[SPARK-3015] Removing broadcast in quick successions causes Akka timeout - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.1.0
Component/s: Spark Core
Labels:
None
Environment:

Standalone EC2 Spark shell

Target Version/s:

1.1.0

Description

This issue is originally reported in ~~SPARK-2916~~ in the context of MLLib, but we were able to reproduce it using a simple Spark shell command:

(1 to 10000).foreach { i => sc.parallelize(1 to 1000, 48).sum }

We still do not have a full understanding of the issue, but we have gleaned the following information so far. When the driver runs a GC, it attempts to clean up all the broadcast blocks that go out of scope at once. This causes the driver to send out many blocking RemoveBroadcast messages to the executors, which in turn send out blocking UpdateBlockInfo messages back to the driver. Both of these calls block until they receive the expected responses. We suspect that the high frequency at which we send these blocking messages is the cause of either dropped messages or internal deadlock somewhere.

Unfortunately, it is highly difficult to reproduce depending on the environment. We have been able to reproduce it on a 6-node cluster in us-west-2, but not in us-west-1, for instance.