[SLING-7778] race-condition in scheduled job creation might keep scheduled job in-memory, and cause a duplicate scheduled job execution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Event 4.2.0
Fix Version/s: Event 4.2.12
Component/s: Event
Labels:
None

Description

case 1 : (extremely unlikely) observation faster than addOrUpdateJob can even finish

The creation of a scheduled job is handled in JobSchedulerImpl.addScheduledJob which in turn calls scheduledJobHandler.addOrUpdateJob. There, the job is first persisted to the repository via calling writeScheduledJob.

Normally the sequence of events is then as follows:

[A] the synchronized block in addOrUpdateJob is entered and the job is scheduled with quartz.
[B] then shortly afterwards, observation is triggered and handleAddOrUpdate is called. This will, as a side-effect add the scheduled job to the scheduledJobs map
the result of this being that the scheduled job is now in ScheduledJobHandler.scheduledJobs
later on, when the scheduled job is finished, handleRemove() (or similar) is called, and it checks if it finds the scheduled job in ScheduledJobHandler.scheduledJobs. if it does, all good. And here we're in the all good case, so all is fine

However, there's the following race-condition:

first [B] happens, ie first the observation event is triggered, so handleAddOrUpdate is called, and again the side-effect hits: the scheduled job is added to the ScheduledJobHandler.scheduledJobs. all good so far
then [A] continues into the synchronized block. But now, the remove of the scheduled job actually does damage: the scheduled job ends up not being in the ScheduledJobHandler.scheduledJobs map.
so in this scenario, when a later handleRemove() (or similar) happens, it can't find the scheduled job (see here h being null)
the result being that JobScheduler.unscheduleJob is not invoked, leaving the scheduled job with the JobSchedulerImpl.

And the result of this race-condition is that JobSchedulerImpl.scheduledJobs map has a zombie scheduled job, which it shouldn't actually have. Note this situation can be verified in the /system/console/slingevent webconsole under "Scheduled Jobs".

So later on for example, a topology event happens, such as a mere PROPERTIES_CHANGED_EVENT. That in turn will trigger JobSchedulerImpl.configurationChanged() which will now re-schedule the previously scheduled job. And we have a duplicate scheduled job execution.

case 2: (likely) scheduled job finishes before observation event triggered

Besides case 1, there's also the more likely case 2: where a scheduled job finishes (thus calls ScheduledJobHandler.remove()) before the observation event for ADDED was ever called (leading to ScheduledJobHandler.handleAddOrUpdate()).

In this case, the remove() fails to delete the scheduled job from the (parent) JobSchedulerImpl, as it doesn't have it in the scheduledJob list yet.

Which results - same as case 1 - in an inconsistency leading to a stale ScheduledJobInfoImpl in JobSchedulerImpl.scheduledJobs map - which when a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job execution.

Note that this is only likely to happen if the schedule is set for 'very soon now', eg at(NOW+1000) or something along those lines. If the schedule is couple minutes in the future, the likelyhood of observation not triggering before drops towards zero.

So overall, this scenario is likely to happen only with such 'immediate schedules' and could be mitigated by increasing the delay to eg 1 or a couple of minutes.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SLING-7778.patch
16/Jul/18 11:58
4 kB
Stefan Egli

Issue Links

is duplicated by

SLING-7762 Sling jobs scheduled by date do not disappear from scheduledJobs map

Closed

is related to

SLING-4680 Decouple scheduled jobs from observation

Closed

Activity

People

Assignee:: Stefan Egli

Reporter:: Stefan Egli

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jul/18 15:31

Updated:: 26/Sep/19 12:39

Resolved:: 17/Jul/18 09:33