Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31373

Cluster tried to fetch blocks from blacklisted node of previous stage

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 2.4.2
    • Fix Version/s: None
    • Component/s: Block Manager, Spark Core
    • Labels:
      None
    • Environment:

      EMR cluster with r5.4xlarge and r5.8xlarge instances

      Description

      We enabled blacklist on our Spark application but recently we saw some wierd issue.

      Our code is like
        rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()
      In mapPartitions stage, some executors has exception "Can't connect to host xxxxxx: Connection rest by peer" and tasks on them were failed, so all executors under this node were blacklisted, as well as this node. These executors did complete some tasks before blacklisted.

      Then in next stage (groupByKey(...).map()), application failed with block fetch failure: IndexOutOfBound Exception when some healthy executor want to fetch block from one of above blacklisted executors.

      It happened multiple times.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              CrossLife Yuchen Feng
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: