Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4592

"Worker registration failed: Duplicate worker ID" error during Master failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.2.0
    • 1.2.0
    • Deploy, Spark Core
    • None

    Description

      When running Spark Standalone in high-availability mode, we sometimes see "Worker registration failed: Duplicate worker ID" errors which prevent workers from reconnecting to the new active master. I've attached full logs from a reproduction in my integration tests suite (which runs something similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a worker log during a failed run of the "rolling outage" test, which creates a multi-master cluster then repeatedly kills the active master, waits for workers to reconnect to a new active master, then kills that master, and so on.

      14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at http://172.17.0.90:8081
      14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.86:7077...
      14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.87:7077...
      14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.88:7077...
      14/11/23 02:23:02 INFO Worker: Successfully registered with master spark://172.17.0.86:7077
      14/11/23 02:23:03 INFO Worker: Asked to launch executor app-20141123022303-0000/1 for spark-integration-tests
      14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" "1" "172.17.0.90" "8" "app-20141123022303-0000" "akka.tcp://sparkWorker@172.17.0.90:8888/user/Worker"
      14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:8888] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
      14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
      14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.86:7077...
      14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.87:7077...
      14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.88:7077...
      14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
      14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:8888] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
      14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
      14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1246122173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
      14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
      14/11/23 02:23:14 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
      14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
      14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.86:7077...
      14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.87:7077...
      14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.88:7077...
      14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
      14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.86:7077...
      14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.87:7077...
      14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.88:7077...
      14/11/23 02:23:42 INFO Worker: Master has changed, new master is at spark://172.17.0.87:7077
      14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
      14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.86:7077...
      14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.87:7077...
      14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.88:7077...
      14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker ID
      14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
      

      Attachments

        1. log.txt
          51 kB
          Josh Rosen

        Issue Links

          Activity

            People

              andrewor14 Andrew Or
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: