[HIVE-16984] HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver died - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Spark
Labels:
None

Description

In HoS, after a RemoteDriver is launched, it may fail to initialize a Spark context and thus the ApplicationMaster will die eventually. In this case, there are two issues related to RemoteSparkJobStatus::getAppID():

1. Currently we call getAppID() before starting the monitoring job. For the first, it will wait for hive.spark.client.future.timeout, and for the latter, it will wait for hive.spark.job.monitor.timeout. The error message for the latter treats the hive.spark.job.monitor.timeout as the time waiting for the job submission. However, this is inaccurate as it doesn't include hive.spark.client.future.timeout.
2. In case the RemoteDriver suddenly died, currently we still may wait hopelessly for the timeouts. This should potentially be avoided if we know that the channel has closed between the client and remote driver.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-16984.1.patch
28/Jun/17 20:01
5 kB
Chao Sun

Activity

People

Assignee:: Chao Sun

Reporter:: Chao Sun

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jun/17 19:07

Updated:: 29/Jun/17 23:16