Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-7709

DocumentNodeStore dispose aborts when store was disposed due to LeaseUpdate error

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.2
    • 1.10.0, 1.9.9, 1.8.10
    • documentmk
    • None
      • Linux
      • IBM JDK 8

    Description

      During a maintenance of our database system the DocumentNodeStore bundle shut down because of a lease timeout. When we restart the bundle, we ran into these exceptions:

      16.08.2018 11:32:44.139 *INFO* [Default Executor-thread-20544] org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreService Configuring persistent cache to only cache nodes under paths [/libs, /apps, /jcr:system/jcr:nodeTypes, /jcr:system/rep:namespaces]
      16.08.2018 11:32:44.155 *INFO* [Default Executor-thread-20544] org.apache.jackrabbit.oak.plugins.document.persistentCache.PersistentCache start, url=/opt/aem/repository/sling/_cqa/cache,size=2048,binary=0
      16.08.2018 11:32:44.156 *WARN* [Default Executor-thread-20544] org.apache.jackrabbit.oak.plugins.document.persistentCache.MapFactory Error in the background thread of the persistent cache: java.lang.IllegalStateException: The file is locked: nio:/opt/aem/repository/sling/_cqa/cache/cache-0.data [1.4.194/7]
      16.08.2018 11:32:44.157 *WARN* [Default Executor-thread-20544] org.apache.jackrabbit.oak.plugins.document.persistentCache.MapFactory Could not open the store /opt/aem/repository/sling/_cqa/cache/cache-0.data
      java.lang.IllegalStateException: The file is locked: nio:/opta/aem/repository/sling/_cqa/cache/cache-0.data [1.4.194/7]
      at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:765)
      at org.h2.mvstore.FileStore.open(FileStore.java:168)
      at org.h2.mvstore.MVStore.<init>(MVStore.java:347)
      at org.h2.mvstore.MVStore$Builder.open(MVStore.java:2930)
      at org.apache.jackrabbit.oak.plugins.document.persistentCache.PersistentCache$1.openStore(PersistentCache.java:288)
      at org.apache.jackrabbit.oak.plugins.document.persistentCache.PersistentCache.createMapFactory(PersistentCache.java:361)
      at org.apache.jackrabbit.oak.plugins.document.persistentCache.PersistentCache.<init>(PersistentCache.java:210)
      at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBuilder.getPersistentCache(DocumentNodeStoreBuilder.java:648)
      at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBuilder.buildCache(DocumentNodeStoreBuilder.java:627)
      at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBuilder.buildDocumentCache(DocumentNodeStoreBuilder.java:586)
      at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBuilder.buildNodeDocumentCache(DocumentNodeStoreBuilder.java:594)
      at org.apache.jackrabbit.oak.plugins.document.rdb.RDBDocumentStore.initialize(RDBDocumentStore.java:967)
      

      It seems that the mvstore wasn't shut down properly during the stop of the DocumentNodeStore. For me it seems that the DocumentNodeStore doesn't use the persistent cache anymore until there was a "hard" restart (on the JVM).

      It would be good if even in this situation the shutdown of the DocumentNodeStore could drop the lock on the mvstore, so a clean restart of the bundle is possible.

      Attachments

        1. OAK-7709.diff
          5 kB
          Julian Reschke
        2. OAK-7709.diff
          4 kB
          Julian Reschke

        Issue Links

          Activity

            reschke Julian Reschke added a comment - - edited

            stefanegli, mreutegg - is restarting the bundle actually a supported way to get the system back up?

            reschke Julian Reschke added a comment - - edited stefanegli , mreutegg - is restarting the bundle actually a supported way to get the system back up?

            In my experience systems are usually restarted entirely (stop the process first) when the lease expires. But starting the bundle again should in my view also work if the bundles built on top of it behave correctly and restart as well when necessary.

            mreutegg Marcel Reutegger added a comment - In my experience systems are usually restarted entirely (stop the process first) when the lease expires. But starting the bundle again should in my view also work if the bundles built on top of it behave correctly and restart as well when necessary.
            reschke Julian Reschke added a comment -

            It would probably good to have test coverage for this then...

            reschke Julian Reschke added a comment - It would probably good to have test coverage for this then...
            stefanegli Stefan Egli added a comment -

            if the bundles built on top of it behave correctly

            that I think is the key and indeed might not be in 100% of the cases. Which is I guess why we rather recommend a restart

            stefanegli Stefan Egli added a comment - if the bundles built on top of it behave correctly that I think is the key and indeed might not be in 100% of the cases. Which is I guess why we rather recommend a restart
            reschke Julian Reschke added a comment -

            FWIW, I just looked at DocumentNodeStoreService, DocumentNodeStore, and PersistentCache, and they seem to attempt proper shutdown.

            reschke Julian Reschke added a comment - FWIW, I just looked at DocumentNodeStoreService, DocumentNodeStore, and PersistentCache, and they seem to attempt proper shutdown.
            reschke Julian Reschke added a comment -

            AFAIU, the root cause here is DocumentNodeStore.dispose()'s attempt to write a tombstone revision, which will fail with an exception upon lease failure, causing the remaining cleanup code to be skipped.

            I'll try to address this, but there may be more issues related to restart after lease failure down the road.

            reschke Julian Reschke added a comment - AFAIU, the root cause here is DocumentNodeStore.dispose()'s attempt to write a tombstone revision, which will fail with an exception upon lease failure, causing the remaining cleanup code to be skipped. I'll try to address this, but there may be more issues related to restart after lease failure down the road.
            reschke Julian Reschke added a comment -

            Proposed patch: OAK-7709.diff - this moves the tombstone write into the "not readonly" case (correct?), and does similar exception catching as the code which is already there.

            reschke Julian Reschke added a comment - Proposed patch: OAK-7709.diff - this moves the tombstone write into the "not readonly" case (correct?), and does similar exception catching as the code which is already there.

            That's correct. The tombstone is only necessary when the store is writable. A read-only DocumentNodeStore does not even perform lease updates and therefore cannot have an expired lease.

            I think the direction of the patch is good, but we should also have a test that reproduces the described problem.

            mreutegg Marcel Reutegger added a comment - That's correct. The tombstone is only necessary when the store is writable. A read-only DocumentNodeStore does not even perform lease updates and therefore cannot have an expired lease. I think the direction of the patch is good, but we should also have a test that reproduces the described problem.
            reschke Julian Reschke added a comment -

            Attached [^OAK-7709-test.diff] which enforces the lease update failure and checks, that the DocumentStore's dispose method was called.

            reschke Julian Reschke added a comment - Attached [^OAK-7709-test.diff] which enforces the lease update failure and checks, that the DocumentStore's dispose method was called.
            reschke Julian Reschke added a comment -

            Updated complete patch: OAK-7709.diff - mreutegg please review.

            reschke Julian Reschke added a comment - Updated complete patch: OAK-7709.diff - mreutegg please review.

            The patch re-orders closing the bundlingConfigHandler and creating the tombstone revision. What's the reason for this change? I think the bundlingConfigHandler should be kept open while other commits are potentially still in progress.

            mreutegg Marcel Reutegger added a comment - The patch re-orders closing the bundlingConfigHandler and creating the tombstone revision. What's the reason for this change? I think the bundlingConfigHandler should be kept open while other commits are potentially still in progress.
            reschke Julian Reschke added a comment -

            The intent was a simplification (moving the code fragment to a place which already handles exceptions). But I indeed didn't consider ordering.

            reschke Julian Reschke added a comment - The intent was a simplification (moving the code fragment to a place which already handles exceptions). But I indeed didn't consider ordering.
            reschke Julian Reschke added a comment -

            Updated patch in https://issues.apache.org/jira/secure/attachment/12938264/OAK-7709.diff - avoiding the reordering of operations

            reschke Julian Reschke added a comment - Updated patch in https://issues.apache.org/jira/secure/attachment/12938264/OAK-7709.diff - avoiding the reordering of operations

            LGTM

            mreutegg Marcel Reutegger added a comment - LGTM
            reschke Julian Reschke added a comment -

            ....just tried to backport to 1.8, merges cleanly, but test setup fails with:

            [ERROR] Failures:
            [ERROR]   DocumentNodeStoreTest.disposeAfterLeaseFailure:785 expected lease failure exception; test setup may be incorrect
            

            mreutegg - any idea why it's not throwing in 1.8? Use of clock? Different exception handling?

            reschke Julian Reschke added a comment - ....just tried to backport to 1.8, merges cleanly, but test setup fails with: [ERROR] Failures: [ERROR] DocumentNodeStoreTest.disposeAfterLeaseFailure:785 expected lease failure exception; test setup may be incorrect mreutegg - any idea why it's not throwing in 1.8? Use of clock? Different exception handling?

            I haven't looked at it in detail, but OAK-7626 crossed my mind. Oak 1.8 is not able to run in strict lease check mode. Maybe that's what the test requires?

            mreutegg Marcel Reutegger added a comment - I haven't looked at it in detail, but OAK-7626 crossed my mind. Oak 1.8 is not able to run in strict lease check mode. Maybe that's what the test requires?
            reschke Julian Reschke added a comment - - edited

            That seems to be the case; commenting out the "lenient" mode makes the test pass.

            Options:

            • commit the fix OAK-7709 without test case
            • backport OAK-7626 as well (with different default)
            • try to modify the test case so it actually triggers a lease failure
            reschke Julian Reschke added a comment - - edited That seems to be the case; commenting out the "lenient" mode makes the test pass. Options: commit the fix OAK-7709 without test case backport OAK-7626 as well (with different default) try to modify the test case so it actually triggers a lease failure
            reschke Julian Reschke added a comment -

            trunk: r1840462
            1.8: r1846721

            reschke Julian Reschke added a comment - trunk: r1840462 1.8: r1846721

            People

              reschke Julian Reschke
              joerghoh Joerg Hoh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: