[SPARK-15857] Add Caller Context in Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Hadoop has implemented a feature of log tracing – caller context (Jira: ~~HDFS-9184~~ and ~~YARN-4349~~). The motivation is to better diagnose and understand how specific applications impacting parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in ~~HDFS-9184~~, for a given HDFS operation, it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, hive queries, Spark jobs.

Hadoop ecosystems like MapReduce, Tez (~~TEZ-2851~~), Hive (~~HIVE-12249~~, ~~HIVE-12254~~) and Pig(~~PIG-4714~~) have implemented their caller contexts. Those systems invoke HDFS client API and Yarn client API to setup caller context, and also expose an API to pass in caller context into it.

Lots of Spark applications are running on Yarn/HDFS. Spark can also implement its caller context via invoking HDFS/Yarn API, and also expose an API to its upstream applications to set up their caller contexts. In the end, the spark caller context written into Yarn log / HDFS log can associate with task id, stage id, job id and app id. That is also very good for Spark users to identify tasks especially if Spark supports multi-tenant environment in the future.

Attachments

Issue Links

breaks

SPARK-17710 ReplSuite fails with ClassCircularityError in master Maven builds

Resolved

is blocked by

HDFS-9184 Logging HDFS operation's caller context into audit logs

Resolved

HADOOP-13527 Add Spark to CallerContext LimitedPrivate scope

Resolved

is related to

SPARK-17714 ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName

Resolved

FLINK-16809 Support setting CallerContext on YARN deployments

Open

links to

[Github] Pull Request #14312 (Sherry302)

design doc

(2 links to)

Sub-Tasks

1.	Set up caller context to HDFS and Yarn	Resolved	Weiqing Yang
2.	Set up caller context to YARN	Resolved	Unassigned
3.	Spark expose an API to pass in Caller Context into it	Resolved	Weiqing Yang
4.	Pass 'jobId' to Task	Closed	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Weiqing Yang

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 10/Jun/16 00:37

Updated:: 26/Mar/20 12:40

Resolved:: 13/Feb/17 21:28