Uploaded image for project: 'Sling'
  1. Sling
  2. SLING-7778

race-condition in scheduled job creation might keep scheduled job in-memory, and cause a duplicate scheduled job execution

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Event 4.2.0
    • Event 4.2.12
    • Event
    • None

    Description

      case 1 : (extremely unlikely) observation faster than addOrUpdateJob can even finish

      The creation of a scheduled job is handled in JobSchedulerImpl.addScheduledJob which in turn calls scheduledJobHandler.addOrUpdateJob. There, the job is first persisted to the repository via calling writeScheduledJob.

      Normally the sequence of events is then as follows:

      • [A] the synchronized block in addOrUpdateJob is entered and the job is scheduled with quartz.
      • [B] then shortly afterwards, observation is triggered and handleAddOrUpdate is called. This will, as a side-effect add the scheduled job to the scheduledJobs map
      • the result of this being that the scheduled job is now in ScheduledJobHandler.scheduledJobs
      • later on, when the scheduled job is finished, handleRemove() (or similar) is called, and it checks if it finds the scheduled job in ScheduledJobHandler.scheduledJobs. if it does, all good. And here we're in the all good case, so all is fine

      However, there's the following race-condition:

      • first [B] happens, ie first the observation event is triggered, so handleAddOrUpdate is called, and again the side-effect hits: the scheduled job is added to the ScheduledJobHandler.scheduledJobs. all good so far
      • then [A] continues into the synchronized block. But now, the remove of the scheduled job actually does damage: the scheduled job ends up not being in the ScheduledJobHandler.scheduledJobs map.
      • so in this scenario, when a later handleRemove() (or similar) happens, it can't find the scheduled job (see here h being null)
      • the result being that JobScheduler.unscheduleJob is not invoked, leaving the scheduled job with the JobSchedulerImpl.

      And the result of this race-condition is that JobSchedulerImpl.scheduledJobs map has a zombie scheduled job, which it shouldn't actually have. Note this situation can be verified in the /system/console/slingevent webconsole under "Scheduled Jobs".

      So later on for example, a topology event happens, such as a mere PROPERTIES_CHANGED_EVENT. That in turn will trigger JobSchedulerImpl.configurationChanged() which will now re-schedule the previously scheduled job. And we have a duplicate scheduled job execution.

      case 2: (likely) scheduled job finishes before observation event triggered

      Besides case 1, there's also the more likely case 2: where a scheduled job finishes (thus calls ScheduledJobHandler.remove()) before the observation event for ADDED was ever called (leading to ScheduledJobHandler.handleAddOrUpdate()).

      In this case, the remove() fails to delete the scheduled job from the (parent) JobSchedulerImpl, as it doesn't have it in the scheduledJob list yet.

      Which results - same as case 1 - in an inconsistency leading to a stale ScheduledJobInfoImpl in JobSchedulerImpl.scheduledJobs map - which when a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job execution.

      Note that this is only likely to happen if the schedule is set for 'very soon now', eg at(NOW+1000) or something along those lines. If the schedule is couple minutes in the future, the likelyhood of observation not triggering before drops towards zero.

      So overall, this scenario is likely to happen only with such 'immediate schedules' and could be mitigated by increasing the delay to eg 1 or a couple of minutes.

      Attachments

        1. SLING-7778.patch
          4 kB
          Stefan Egli

        Issue Links

          Activity

            People

              stefanegli Stefan Egli
              stefanegli Stefan Egli
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: