[SPARK-33841] Jobs disappear intermittently from the SHS under high load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0, 3.0.1, 3.1.0, 3.2.0
Fix Version/s: 3.0.2, 3.1.0, 3.2.0
Component/s: Spark Core
Labels:
None
Environment:

SHS is running locally on Ubuntu 19.04

Description

Ran into an issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

The issue is caused by ~~SPARK-29043~~, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for processing event log entries. Since SHS now filters out all processing event log entries, such entries do not have a chance to be updated with the new lastProcessed time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the next checkForLogs run. This is because updated lastProcessed time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.

The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

Attachments

Issue Links

links to

[Github] Pull Request #30842 (vladhlinsky)

[Github] Pull Request #30845 (vladhlinsky)

[Github] Pull Request #30847 (vladhlinsky)

Activity

People

Assignee:: Vladislav Glinskiy

Reporter:: Vladislav Glinskiy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Dec/20 13:01

Updated:: 18/Dec/20 23:20

Resolved:: 18/Dec/20 21:27