Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24346

Executors are unable to fetch remote cache blocks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • Shuffle, Spark Core
    • OS: Centos 7.3
      Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0

    Description

      After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a massive performance hit because executors become unable to fetch remote cache block from each others. The scenario is:

      1. An executor creates a connection and sends a ChunkFetchRequest message to another executor.
      2. This request arrives at the target executor, which sends back a ChunkFetchSuccess response
      3. The ChunkFetchSuccess msg never arrives.
      4. The connection between these two executors is killed by the originating executor after 120s of idleness. At the same time, the other executor report that it failed to send the ChunkFetchSuccess because the pipe is closed.

      This process repeats itself 3 times, delaying our jobs by 6 minutes, then the originating executor decides to stop fetching and calculates the block by itself and the job can continue.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kien_truong Truong Duc Kien
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: