[SPARK-21991] [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.2, 2.1.0, 2.1.1, 2.2.0
Fix Version/s: 2.0.3, 2.1.3, 2.2.1, 2.3.0
Component/s: Spark Submit
Labels:
None
Environment:

Single node machine running Ubuntu 16.04.2 LTS (4.4.0-79-generic)
YARN 2.7.2
Spark 2.0.2

Description

The way the LauncherServer acceptConnections thread schedules client timeouts causes (non-deterministically) the thread to die with the following exception if the machine is under very high load:

Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task already scheduled or cancelled
        at java.util.Timer.sched(Timer.java:401)
        at java.util.Timer.schedule(Timer.java:193)
        at org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249)
        at org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80)
        at org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143)

The issue is related to the ordering of actions that the acceptConnections thread uses to handle a client connection:

create timeout action
create client thread
start client thread
schedule timeout action

Under normal conditions the scheduling of the timeout action happen before the client thread has a chance to start, however if the machine is under very high load the client thread can receive CPU time before the timeout action gets scheduled.

If this condition happen, the client thread cancel the timeout action (which is not yet been scheduled) and goes on, but as soon as the acceptConnections thread gets the CPU back, it will try to schedule the timeout action (which has already been canceled) thus raising the exception.

Changing the order in which the client thread gets started and the timeout gets scheduled seems to be sufficient to fix this issue.

As stated above the issue is non-deterministic, I faced the issue multiple times on a single-node machine submitting a high number of short jobs sequentially, but I couldn't easily create a test reproducing the issue.

Attachments

Issue Links

links to

[Github] Pull Request #19217 (nivox)

[Github] Pull Request #19574 (ash211)

Activity

People

Assignee:: Andrea Zito

Reporter:: Andrea Zito

Shepherd:: Marcelo Masiero Vanzin

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Sep/17 09:02

Updated:: 25/Oct/17 18:04

Resolved:: 25/Oct/17 17:12