Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.1
-
None
Description
Some of our Spark users complain that their application was not showing up in history server UI. Our analysis suggests that this is a side effect of some application’s event log being too big. This is especially true for spark ML applications that may have lot of iterations but is applicable to other kind of spark jobs too. For example on my local machine just running the following generates an event log of size 80MB.
./spark-shell --master yarn --deploy-mode client --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events val words = sc.textFile(“test.txt”) for(i <- 1 to 10000) words.count sc.close
For one of our user this file was as big as 12GB. He was running logistic regression using spark ML. Given each application generates its own application event log and event logs are processed serially in a single thread, one huge application can result in lot of users not being able to view their application on the main UI. To overcome this issue I propose to make the replay execution multi threaded so a single large event log won’t block other applications from being rendered into UI. This still cannot solve the issue completely if there are too many large event logs but the alternatives I have considered (Read chunks from begin and end to get Application Start and End event, Modify the event log format so it has this info in header or footer) are all more intrusive.
In addition there are several other things we can do to improve History Server implementation.
- During the log checker phase to identify application start and end time the replaying thread processes the whole event log and throws away all the info apart from application start and end event. This is pretty huge waste given as soon as a user clicks on the application we reprocess the same event log to get job/task details. We should either optimize the first level of parsing so it reads some chunks from beginning and end to identify the application level details or better yet cache the job/task level details when we process the file for the first time.
- On the details job page there is no pagination and we only show the last 1000 job events when there are > 1000 job events. Granted when users have more than 1K jobs they probably won't page through them but not even having that option is bad experience. Also if that page is paginated we could probably do away with partial processing of the event log until the user wants to view the next page. This can help in cases where processing really large files causes OOM issues as we will only be processing a subset of the file.
- On startup, the history server reprocesses the whole event log. For the top level application details, we could persist the processing results from the last run in a more compact and searchable format to improve the bootstrap time. This is briefly mentioned in
SPARK-6951.
Attachments
Issue Links
- is related to
-
SPARK-6951 History server slow startup if the event log directory is large
- Resolved
- links to