[SPARK-1499] Workers continuously produce failing executors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.9.1, 1.0.0
Fix Version/s: None
Component/s: Deploy, Spark Core
Labels:
None

Description

If a node is in a bad state, such that newly started executors fail on startup or first use, the Standalone Cluster Worker will happily keep spawning new ones. A better behavior would be for a Worker to mark itself as dead if it has had a history of continuously producing erroneous executors, or else to somehow prevent a driver from re-registering executors from the same machine repeatedly.

Reported on mailing list: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAL8t0BqJFgtf-Vbzjq6Yj7CKBL_9P9S0tRVEW2MVG6ZBNgxY2g@mail.gmail.com%3E

Relevant logs:

14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/4 is now FAILED (Command exited with code 53)
14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/4 removed: Command exited with code 53
14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 disconnected, so removing it
14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 (already removed): Failed to create local directory (bad spark.local.dir?)
14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/27 on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/27 on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27 is now RUNNING
14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 with 32.7 GB RAM
14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=wikistats_pd
14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=default tbl=wikistats_pd	
14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string projectcode, string pagename, i32 pageviews, i32 bytes}
14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, string, int, int] separator=[[B@29a81175] nullstring=\N lastColumnTakesRest=false
shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] with ID 27
show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 disconnected, so removing it
14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 (already removed): remote Akka client disassociated
14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27 is now FAILED (Command exited with code 53)
14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/27 removed: Command exited with code 53
14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/28 on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/28 on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/28 is now RUNNING
tables;

Attachments

Issue Links

duplicates

SPARK-6183 Skip bad workers when re-launching executors

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Aaron Davidson

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/Apr/14 06:25

Updated:: 08/Apr/15 10:55

Resolved:: 05/Mar/15 09:58