Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.7.0
-
None
Description
Current spark logs are outputted to console or one spark log file, which is not convenient for analysis of one single job. We would like to implement a JobLogger for Spark which output one history file for each job(ActiveJob). now the Spark has task metrics and summaries. the history file can be built on top of them.
the job history contains:
1.additinal information from outside. for example: query plan from Shark
2.RDD graph for the job.
3.task's start/stop and shuffle information
4.stage information
a new class named JobLogger does this job:
1.each SparkContext has one JobLogger, and one folder is created for every JobLogger
2.JobLogger manages all history files of activeJobs running in that SparkCOntext, create one history file for each activeJob, and the file name is the jobID
3.JobLogger generate job history and outputted it into the history file
Job history generation:
1.additional information from outside
For example: to get queryplan from Shark, the interface between shark and spark would be modified to pass the information from Shark to Spark.
2.record RDD graph for each Job
The RDD graph is printed using a top-down approach, the RDD dependencies are outputted recursively from finalRDD, and the parent-child relationship is represented by indent.
3.task's start/stop and shuffle information
can be gotten from TaskMetrics and TaskSetManager
4.stage information
can be gotten from StageInfo and DAGScheduler