Details
Description
Problem:
HistoryFileManager.addIfAbsent produces large amount of logs if number of
cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
larger than mapreduce.jobhistory.joblist.cache.size by far.
Example:
For example, if the cache contains 50000 entries in total and 10,000 entries
newer than mapreduce.jobhistory.max-age-ms where
mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
JobListCache because it is not in done yet" message.
It will attach a stacktrace.
Impact:
In addition to large disk consumption, this issue blocks JobHistory.getJob
long time and slows job execution down significantly because getJob is called
by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
multiple threads call scanIfNeeded simultaneously, one of them acquires lock
and the other threads are blocked until the first thread completes long-running
HistoryFileManager.addIfAbsent call.
Solution:
- Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
- Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
scanning if another thread is already scanning. This changes semantics of
some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
because scanIfNeeded keep outdated state. - Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
not blocked by a loop at scale of tens of thousands.
This patch implemented the first item.
Attachments
Attachments
Issue Links
- is related to
-
MAPREDUCE-6684 High contention on scanning of user directory under immediate_done in Job History Server
- Resolved
-
MAPREDUCE-6573 Reduce the time of calling scanIntermediateDirectory in getFileInfo
- Open