Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.0
-
None
-
OS: Centos 7.3
Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0
Description
After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a massive performance hit because executors become unable to fetch remote cache block from each others. The scenario is:
1. An executor creates a connection and sends a ChunkFetchRequest message to another executor.
2. This request arrives at the target executor, which sends back a ChunkFetchSuccess response
3. The ChunkFetchSuccess msg never arrives.
4. The connection between these two executors is killed by the originating executor after 120s of idleness. At the same time, the other executor report that it failed to send the ChunkFetchSuccess because the pipe is closed.
This process repeats itself 3 times, delaying our jobs by 6 minutes, then the originating executor decides to stop fetching and calculates the block by itself and the job can continue.