[SPARK-24346] Executors are unable to fetch remote cache blocks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
- bulk-closed
Environment:

OS: Centos 7.3
Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0

Description

After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a massive performance hit because executors become unable to fetch remote cache block from each others. The scenario is:

1. An executor creates a connection and sends a ChunkFetchRequest message to another executor.
2. This request arrives at the target executor, which sends back a ChunkFetchSuccess response
3. The ChunkFetchSuccess msg never arrives.
4. The connection between these two executors is killed by the originating executor after 120s of idleness. At the same time, the other executor report that it failed to send the ChunkFetchSuccess because the pipe is closed.

This process repeats itself 3 times, delaying our jobs by 6 minutes, then the originating executor decides to stop fetching and calculates the block by itself and the job can continue.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Truong Duc Kien

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/May/18 13:05

Updated:: 25/May/21 01:50

Resolved:: 25/May/21 01:44