Uploaded image for project: 'Sling'
  1. Sling
  2. SLING-5421

Allow JCR installer to recover from being paused indefinitely

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: JCR Installer 3.1.8
    • Fix Version/s: None
    • Component/s: Installer
    • Labels:
      None

      Description

      With SLING-3747 the JCR installer provided a mechanism for pausing the installer to support cases where installation can result in restart of installer bundle itself.

      However it may happen that once this flag is set the process gets abruptly killed and the flag remain set. In such a case the installer would remain paused and a user would have to remove the flag for it to work again. To support such cases there should be some kind of timeout such that installer does not remain in pause state forever.

      Note also the discussion on the mailinglist.

        Issue Links

          Activity

          Hide
          jsedding Julian Sedding added a comment -

          FYI: In the following I will call nodes created under /system/sling/installer/jcr/pauseInstallation "pause-marker".

          Another possible reason for an orphaned pause-marker could be that an InterruptedException is thrown, which causes the NIO resources of an Oak repository to be closed prematurely. This would prevent any further writes to the repository and thus not allow a client to clean up its pause-marker.

          Two points to note:

          • We need to record the pausing of the JCR installer in the repository, so other instances in a cluster can see it and also pause their installers.
          • We need to be able to recognize an orphaned pause-marker and automatically remove it at some point (e.g. during startup).

          I believe that this issue shows the limitations of the current implementation that is based on an unenforced convention between different components in the system (i.e. another service can block the JCR installer indefinitely).

          To move forward there seems to be a consensus in offline discussions, that the installer should provide an API to allow services to pause it.

          Having an API would then allow making the implementation more robust in a single place. Also, the responsibility for recovery (i.e. removing orphaned pause-markers) would reside with the component that is blocked and thus allow it to self heal).

          Show
          jsedding Julian Sedding added a comment - FYI: In the following I will call nodes created under /system/sling/installer/jcr/pauseInstallation "pause-marker". Another possible reason for an orphaned pause-marker could be that an InterruptedException is thrown, which causes the NIO resources of an Oak repository to be closed prematurely. This would prevent any further writes to the repository and thus not allow a client to clean up its pause-marker. Two points to note: We need to record the pausing of the JCR installer in the repository, so other instances in a cluster can see it and also pause their installers. We need to be able to recognize an orphaned pause-marker and automatically remove it at some point (e.g. during startup). I believe that this issue shows the limitations of the current implementation that is based on an unenforced convention between different components in the system (i.e. another service can block the JCR installer indefinitely). To move forward there seems to be a consensus in offline discussions, that the installer should provide an API to allow services to pause it. Having an API would then allow making the implementation more robust in a single place. Also, the responsibility for recovery (i.e. removing orphaned pause-markers) would reside with the component that is blocked and thus allow it to self heal).
          Hide
          bdelacretaz Bertrand Delacretaz added a comment -

          If an API is created for this it can probably be made generic, in the end it's just a shared boolean flag with a time to live.

          Show
          bdelacretaz Bertrand Delacretaz added a comment - If an API is created for this it can probably be made generic, in the end it's just a shared boolean flag with a time to live.
          Hide
          bdelacretaz Bertrand Delacretaz added a comment -

          I have removed the fix version on this ticket as I'm about to release JCR Installer 3.1.18 for SLING-5371

          Show
          bdelacretaz Bertrand Delacretaz added a comment - I have removed the fix version on this ticket as I'm about to release JCR Installer 3.1.18 for SLING-5371
          Hide
          olli Oliver Lietz added a comment -

          Setting priority to Critical as orphaned pause-markers render package installations (with bundles containing initial content) on AEM 6.1 broken frequently.

          Show
          olli Oliver Lietz added a comment - Setting priority to Critical as orphaned pause-markers render package installations (with bundles containing initial content) on AEM 6.1 broken frequently.
          Hide
          amuthmann Alexander Muthmann added a comment -

          To whom it may concern, this was fixed with SP2 released on 11. August for AEM 6.1 (GRANITE-10726).

          Show
          amuthmann Alexander Muthmann added a comment - To whom it may concern, this was fixed with SP2 released on 11. August for AEM 6.1 (GRANITE-10726).
          Hide
          kpolychr Kimon Polychroniadis added a comment -

          Is there a fix available for AEM 6.0?

          Show
          kpolychr Kimon Polychroniadis added a comment - Is there a fix available for AEM 6.0?

            People

            • Assignee:
              chetanm Chetan Mehrotra
              Reporter:
              chetanm Chetan Mehrotra
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development