[SPARK-13328] Possible poor read performance for broadcast variables with dynamic resource allocation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.2
Fix Version/s: 2.0.0
Component/s: Spark Core
Labels:
None

Target Version/s:

2.0.0

Description

When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and ~~SPARK-9591~~ fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (below log shows the failed 70th block fetch attempt where each attempt takes 15s)

...
16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 60675) (failed attempt 70)
...
16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 18 took 1051049 ms

Attachments

Issue Links

links to

[Github] Pull Request #11241 (nezihyigitbasi)

Activity

People

Assignee:: Nezih Yigitbasi

Reporter:: Nezih Yigitbasi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Feb/16 23:24

Updated:: 11/Mar/16 19:39

Resolved:: 11/Mar/16 19:39