Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.6.1, 2.0
    • Fix Version/s: 1.6.2, 2.1
    • Component/s: jackrabbit-core
    • Labels:
    • Environment:
      Weblogic 9.2

      Description

      In one of our client deployments on WebLogic 9.2 we observed JackRabbit sessions going stale in a load test. This was observed against release 1.6.1 (to which we migrated due to concurrency related issues JCR-2081 and JCR-2237). Same effect with 2.0.0.

      I could finally reproduce this issue locally. And it seems to boil down to WLS invoking the sequence of <prepare> ... <release> ... <commit> on one XA session from multiple threads, as it seems breaking assumptions of the thread-bound java.util.concurrent-RWLock based DefaultISMLocking class.
      Effectively the setActiveXid(..) method on DefaultISMLocking$RWLock fails as the old active XID was not yet cleared. With the result of more and more sessions deadlocking in below's invocation stack.

      "[ACTIVE] ExecuteThread: '27' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x33fc3ec0 nid=0x2324 in Object.wait() [0x2156a000..0x2156beb0] at java.lang.Object.wait(Native Method) - waiting on <0x68a54698> (a EDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock) at java.lang.Object.wait(Object.java:474) at EDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock.acquire(Unknown Source) - locked <0x68a54698> (a EDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock) at org.apache.jackrabbit.core.state.DefaultISMLocking$1.<init>(DefaultISMLocking.java:64) at org.apache.jackrabbit.core.state.DefaultISMLocking.acquireWriteLock(DefaultISMLocking.java:61) at org.apache.jackrabbit.core.version.AbstractVersionManager.acquireWriteLock(AbstractVersionManager.java:146) at org.apache.jackrabbit.core.version.XAVersionManager$1.prepare(XAVersionManager.java:562) at org.apache.jackrabbit.core.TransactionContext.prepare(TransactionContext.java:154) - locked <0x6dc2ad88> (a org.apache.jackrabbit.core.TransactionContext) at org.apache.jackrabbit.core.XASessionImpl.prepare(XASessionImpl.java:331) at org.apache.jackrabbit.jca.TransactionBoundXAResource.prepare(TransactionBoundXAResource.java:68) at weblogic.connector.security.layer.AdapterLayer.prepare(AdapterLayer.java:397) at weblogic.connector.transaction.outbound.XAWrapper.prepare(XAWrapper.java:297) at weblogic.transaction.internal.XAServerResourceInfo.prepare(XAServerResourceInfo.java:1276) at weblogic.transaction.internal.XAServerResourceInfo.prepare(XAServerResourceInfo.java:499) at weblogic.transaction.internal.ServerSCInfo$1.execute(ServerSCInfo.java:335) at weblogic.kernel.Kernel.executeIfIdle(Kernel.java:243) at weblogic.transaction.internal.ServerSCInfo.startPrepare(ServerSCInfo.java:326) at weblogic.transaction.internal.ServerTransactionImpl.localPrepare(ServerTransactionImpl.java:2516) at weblogic.transaction.internal.ServerTransactionImpl.globalPrepare(ServerTransactionImpl.java:2211) at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:266) at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227) at weblogic.transaction.internal.TransactionManagerImpl.commit(TransactionManagerImpl.java:283) at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1028) at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:709) at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:678)
      
      1. Xid.patch
        3 kB
        Claus Köll
      2. Xid_v4.patch
        12 kB
        Claus Köll
      3. Xid_v3.patch
        5 kB
        Claus Köll
      4. Xid_v2 deadlock.jpg
        62 kB
        Robert Sauer
      5. Xid_v2 + SynchronizedRef for markedXid.jpg
        63 kB
        Robert Sauer
      6. Xid_v2.patch
        5 kB
        Claus Köll
      7. LockManager.patch
        4 kB
        Claus Köll
      8. jr-core-xid-aware-ism-locking1.patch
        16 kB
        Robert Sauer
      9. ConcurrentReaders.java
        5 kB
        Robert Sauer

        Activity

        Hide
        Jukka Zitting added a comment -

        Merged to the 1.6 branch in revision 938462.

        Show
        Jukka Zitting added a comment - Merged to the 1.6 branch in revision 938462.
        Hide
        Claus Köll added a comment -

        Hi .. here is the patch ...

        greets
        claus

        Show
        Claus Köll added a comment - Hi .. here is the patch ... greets claus
        Hide
        Robert Sauer added a comment -

        Thanks, Claus. Can you provide a patch against 1.6.1 so I can additionally verify against our environment, too?

        Show
        Robert Sauer added a comment - Thanks, Claus. Can you provide a patch against 1.6.1 so I can additionally verify against our environment, too?
        Hide
        Claus Köll added a comment -

        Implemented a Xid aware ReentrantLock in LockManagerImpl and now no deadlocks have been occured anymore in our environment.

        Show
        Claus Köll added a comment - Implemented a Xid aware ReentrantLock in LockManagerImpl and now no deadlocks have been occured anymore in our environment.
        Hide
        Claus Köll added a comment -

        On the first look i think the problem is in the LockManagerImpl. With the last commit I have reverted the code from the internal lockMapLock (ReentrantLock).
        In that ReentrantLock there was a code with Xid aware but i have removed that because i thought that we do not need it anymore but in our
        environment it causes deadlocks. I will take a deeper look on that.
        greets

        Show
        Claus Köll added a comment - On the first look i think the problem is in the LockManagerImpl. With the last commit I have reverted the code from the internal lockMapLock (ReentrantLock). In that ReentrantLock there was a code with Xid aware but i have removed that because i thought that we do not need it anymore but in our environment it causes deadlocks. I will take a deeper look on that. greets
        Hide
        Robert Sauer added a comment -

        What's the background? I would like to extend my test driver to may be expose this issue locally, too.

        Thanks
        Robert

        Show
        Robert Sauer added a comment - What's the background? I would like to extend my test driver to may be expose this issue locally, too. Thanks Robert
        Hide
        Claus Köll added a comment -

        Found one more deadlock problem must reopen it ..

        Show
        Claus Köll added a comment - Found one more deadlock problem must reopen it ..
        Hide
        Claus Köll added a comment - - edited

        I have taken a look at JCR-1600 and JCR-447 where the code was introduced but i found no reason why this code was modified in that way to ignore the waitingWriters_ variable. I will change the code to work as the SuperClass should work.
        greets

        Show
        Claus Köll added a comment - - edited I have taken a look at JCR-1600 and JCR-447 where the code was introduced but i found no reason why this code was modified in that way to ignore the waitingWriters_ variable. I will change the code to work as the SuperClass should work. greets
        Hide
        Robert Sauer added a comment -

        My interpretation is that it implements the WriterPreference part of the WriterPreferenceReadWriteLock super class. Essentially allowing no new readers as long as there is someone waiting to write (see JavaDoc of WriterPreferenceReadWriteLock).
        I would suggest to stay as close as possible to the super implementation of overridden methods. It's part of both allowWriter() implementations for ReentrantWriterPreferenceReadWriteLock as well as WriterPreferenceReadWriteLock.

        Practical side effect should be that TX-finalization will tend to be quicker in a mixed read/write scenario as parallel readers will not be able to starve 2PC requests.

        Show
        Robert Sauer added a comment - My interpretation is that it implements the WriterPreference part of the WriterPreferenceReadWriteLock super class. Essentially allowing no new readers as long as there is someone waiting to write (see JavaDoc of WriterPreferenceReadWriteLock). I would suggest to stay as close as possible to the super implementation of overridden methods. It's part of both allowWriter() implementations for ReentrantWriterPreferenceReadWriteLock as well as WriterPreferenceReadWriteLock. Practical side effect should be that TX-finalization will tend to be quicker in a mixed read/write scenario as parallel readers will not be able to starve 2PC requests.
        Hide
        Claus Köll added a comment -

        Fixed in rev. 923277

        Show
        Claus Köll added a comment - Fixed in rev. 923277
        Hide
        Claus Köll added a comment -

        Sounds good that it works now

        I dont know if we should consult the waitingWriters_ because this code was introduced in the internal RWLock while ago .. and i don't know exactly why ?
        Yes you are right we will look how we migrate that code to the java1.5 concurrent implementation ...

        I will tag this issue to 1.6.2 so it will be in that release but i don't know the release date.
        Please feel free to post to the dev-mailing list so we can discuss release date there
        greets
        claus

        Show
        Claus Köll added a comment - Sounds good that it works now I dont know if we should consult the waitingWriters_ because this code was introduced in the internal RWLock while ago .. and i don't know exactly why ? Yes you are right we will look how we migrate that code to the java1.5 concurrent implementation ... I will tag this issue to 1.6.2 so it will be in that release but i don't know the release date. Please feel free to post to the dev-mailing list so we can discuss release date there greets claus
        Hide
        Robert Sauer added a comment -

        Thanks, Claus. The v4 patch seems to do the trick.

        One question: the allowReader() method in XidRWLock does not consult the waitingWriters_ member, contrary to the super implementation of allowReader(). I guess not doing so will loose the writer-preference property of the super class. Was this intended?
        I applied this change locally and still things look fine.

        Will this patch go into 1.6.2? Is there already a release date planned for 1.6.2? We're very eager to be able to put the fix into production.

        I agree that reusing edu-concurrent is probably be the safer way to implement this fix right now. However, in face of JCR-2089 I'm curious how this will translate eventually.

        Show
        Robert Sauer added a comment - Thanks, Claus. The v4 patch seems to do the trick. One question: the allowReader() method in XidRWLock does not consult the waitingWriters_ member, contrary to the super implementation of allowReader(). I guess not doing so will loose the writer-preference property of the super class. Was this intended? I applied this change locally and still things look fine. Will this patch go into 1.6.2? Is there already a release date planned for 1.6.2? We're very eager to be able to put the fix into production. I agree that reusing edu-concurrent is probably be the safer way to implement this fix right now. However, in face of JCR-2089 I'm curious how this will translate eventually.
        Hide
        Claus Köll added a comment -

        Implemented a Patch that checks in similar way as your patch if there is a Xid present and switches between 2 ReentrantReadWriteLocks.
        The different is that it is handled inside the DefaultISMLocking and uses the reentrant logic from used concurrent package.

        Show
        Claus Köll added a comment - Implemented a Patch that checks in similar way as your patch if there is a Xid present and switches between 2 ReentrantReadWriteLocks. The different is that it is handled inside the DefaultISMLocking and uses the reentrant logic from used concurrent package.
        Hide
        Claus Köll added a comment -

        I don't think that there is such scenario to handle both requests at one time.

        Show
        Claus Köll added a comment - I don't think that there is such scenario to handle both requests at one time.
        Hide
        Robert Sauer added a comment -

        Fully understood.

        Do you know of any scenario where one instance of ISMLocking might need to serve both XA as well as non-XA requests?

        Show
        Robert Sauer added a comment - Fully understood. Do you know of any scenario where one instance of ISMLocking might need to serve both XA as well as non-XA requests?
        Hide
        Claus Köll added a comment -

        I would like to have the XA specific locking based on Xids working in the DefaultISMLocking so that nobody must configure something special
        if you change your environment to XA.
        I don't know how it is still possible to acquire a writeLock and then the setActiveXid has a other globalTransactionId than the Xid first marked inside a synchronized block
        that creates a lock on the instance itself ?
        I will try to investigate more time to create a TestCase to get more informations about that race condition ..

        Show
        Claus Köll added a comment - I would like to have the XA specific locking based on Xids working in the DefaultISMLocking so that nobody must configure something special if you change your environment to XA. I don't know how it is still possible to acquire a writeLock and then the setActiveXid has a other globalTransactionId than the Xid first marked inside a synchronized block that creates a lock on the instance itself ? I will try to investigate more time to create a TestCase to get more informations about that race condition ..
        Hide
        Robert Sauer added a comment -

        Unfortunately still the same error.

        Would you mind if we try to work backwards from my known to be good patch and provide a variant that covers both scenarios with one class? I think the core change will be adjusting the XidLockInfo class towards a simple LockInfo that may either be associated with an XID or just a thread.
        Even if something like this would end up in an additional class (similar to FineGrainedISMLocking) that can be configured in repository.xml if someone likes that would be pretty fine with me.

        Show
        Robert Sauer added a comment - Unfortunately still the same error. Would you mind if we try to work backwards from my known to be good patch and provide a variant that covers both scenarios with one class? I think the core change will be adjusting the XidLockInfo class towards a simple LockInfo that may either be associated with an XID or just a thread. Even if something like this would end up in an additional class (similar to FineGrainedISMLocking) that can be configured in repository.xml if someone likes that would be pretty fine with me.
        Hide
        Claus Köll added a comment -

        Better to synchronize on the Object iteslf

        Show
        Claus Köll added a comment - Better to synchronize on the Object iteslf
        Hide
        Claus Köll added a comment -

        i think that the internal Xid check is now ok but the problem is the race condition as we see in your picture attachement. i don't know
        how you implemented the SynchronizedRef but i hope with a sync on a mutex should help now

        Show
        Claus Köll added a comment - i think that the internal Xid check is now ok but the problem is the race condition as we see in your picture attachement. i don't know how you implemented the SynchronizedRef but i hope with a sync on a mutex should help now
        Hide
        Claus Köll added a comment -

        A new patch that creates a lock on a mutex to prevent race conditions. i think we can use the DefaultISMLocking instance itself as mutex but
        it would be fine if you could test this patch first ..
        thanks
        claus

        Show
        Claus Köll added a comment - A new patch that creates a lock on a mutex to prevent race conditions. i think we can use the DefaultISMLocking instance itself as mutex but it would be fine if you could test this patch first .. thanks claus
        Hide
        Robert Sauer added a comment -

        Xid_v2 + SynchronizedRef for markedXid. Still the same error.

        Show
        Robert Sauer added a comment - Xid_v2 + SynchronizedRef for markedXid. Still the same error.
        Hide
        Robert Sauer added a comment -

        With Xid_v2 applied It's still hitting the error condition in DefaultISMLocking:136. Concurent threads are now deadlocking in DefaultISMLocking:67.

        Show
        Robert Sauer added a comment - With Xid_v2 applied It's still hitting the error condition in DefaultISMLocking:136. Concurent threads are now deadlocking in DefaultISMLocking:67.
        Hide
        Robert Sauer added a comment -

        About to test this. But I think there still is a race condition with markedXid. It should be a SynchronizedRef as it is written from an unguarded code block.

        Show
        Robert Sauer added a comment - About to test this. But I think there still is a race condition with markedXid. It should be a SynchronizedRef as it is written from an unguarded code block.
        Hide
        Claus Köll added a comment -

        Now at first the Xid's will be checked and then the default thread bound logic will work if we are not in a XAEnvironment.

        Show
        Claus Köll added a comment - Now at first the Xid's will be checked and then the default thread bound logic will work if we are not in a XAEnvironment.
        Hide
        Robert Sauer added a comment -

        Ok, I was just about to verify but will wait then.

        Admittedly your patch is pretty elegant due to the small change required.
        Still I have some reservation on having the underlying assumption of locks being bound to threads, and working around this assumption if an XID is present. Wonder how this carries over to java.util.concurrency eventually.

        Thanks
        Robert

        Show
        Robert Sauer added a comment - Ok, I was just about to verify but will wait then. Admittedly your patch is pretty elegant due to the small change required. Still I have some reservation on having the underlying assumption of locks being bound to threads, and working around this assumption if an XID is present. Wonder how this carries over to java.util.concurrency eventually. Thanks Robert
        Hide
        Claus Köll added a comment -

        Found one more problem .. i will investigate and let you know if it works
        greets

        Show
        Claus Köll added a comment - Found one more problem .. i will investigate and let you know if it works greets
        Hide
        Claus Köll added a comment -

        Can you please test this patch in your environment ?
        thanks

        Show
        Claus Köll added a comment - Can you please test this patch in your environment ? thanks
        Hide
        Claus Köll added a comment -

        You are absolutly right with your assumption.

        I have tried to create a TestCase but it is a little bit hard but i will try to create one
        The Problem is that the prefered locking strategy in the DefaultISMLocking is per Thread. I will
        write a patch that hanldes this a little bit easier than yours one without having two ISMLocking Instances.

        greets
        claus

        Show
        Claus Köll added a comment - You are absolutly right with your assumption. I have tried to create a TestCase but it is a little bit hard but i will try to create one The Problem is that the prefered locking strategy in the DefaultISMLocking is per Thread. I will write a patch that hanldes this a little bit easier than yours one without having two ISMLocking Instances. greets claus
        Hide
        Robert Sauer added a comment - - edited

        Essentially yes. I had breakpoint on all critical places, and this one was triggered.

        I assume the call sequences is as follows (having the JCA adapter driven by a WLS workmanager pool):
        1. thread-a invokes prepare on behalf of XID_1

        • the write lock is granted to thread-a
        • XID_1 is stored as the ActiveXID
          2. thread-a now invokes prepare on behalf of XID_2
        • tries to acquire the write lock, theoretically it should block until XID_1 was committed
        • but due to the previous association between the write lock and thread-a locking succeeds
        • thread-a now tries to set the ActiveXID to XID_2, but as XID_1 is still active, it will just trace a warning
          3. thread-b invokes commit on behalf of XID_1
        • the ActiveXID is cleared
        • the write lock gets released
          4. eventually some thread invokes commit on behalf of XID_2, but as the ActiveXID was not set, signalling readers fails

        I did not know how to easily fix 2. as the lock-to-thread association seems to be fundamentally wrong in this case. So I started with the different approach contained in the patch.

        Show
        Robert Sauer added a comment - - edited Essentially yes. I had breakpoint on all critical places, and this one was triggered. I assume the call sequences is as follows (having the JCA adapter driven by a WLS workmanager pool): 1. thread-a invokes prepare on behalf of XID_1 the write lock is granted to thread-a XID_1 is stored as the ActiveXID 2. thread-a now invokes prepare on behalf of XID_2 tries to acquire the write lock, theoretically it should block until XID_1 was committed but due to the previous association between the write lock and thread-a locking succeeds thread-a now tries to set the ActiveXID to XID_2, but as XID_1 is still active, it will just trace a warning 3. thread-b invokes commit on behalf of XID_1 the ActiveXID is cleared the write lock gets released 4. eventually some thread invokes commit on behalf of XID_2, but as the ActiveXID was not set, signalling readers fails I did not know how to easily fix 2. as the lock-to-thread association seems to be fundamentally wrong in this case. So I started with the different approach contained in the patch.
        Hide
        Claus Köll added a comment -

        Hi Robert,

        Dou you have seen the warn message "Unable to set the ActiveXid while a other one is associated with a different GloalTransactionId with this RWLock."
        with the original DefaultISMLocking ?

        greets
        claus

        Show
        Claus Köll added a comment - Hi Robert, Dou you have seen the warn message "Unable to set the ActiveXid while a other one is associated with a different GloalTransactionId with this RWLock." with the original DefaultISMLocking ? greets claus
        Hide
        Claus Köll added a comment -

        Hi Robert,

        Thank you for the Attachements .. i will take a look on it so i look forward to create a good patch against the deadlock

        greets
        claus

        Show
        Claus Köll added a comment - Hi Robert, Thank you for the Attachements .. i will take a look on it so i look forward to create a good patch against the deadlock greets claus
        Hide
        Robert Sauer added a comment -

        Attaching my test driver. This is currently using a proprietary service tier in front of the JCR session but still the intention should be clear.

        The getFolder call basically just looks for the appropriate nt:folder node via the given path, and extracts some core atributes as well as sub nodes.

        Where can I find an example on how to provide a plain JackRabbit test case that can be deployed to an app server?

        Show
        Robert Sauer added a comment - Attaching my test driver. This is currently using a proprietary service tier in front of the JCR session but still the intention should be clear. The getFolder call basically just looks for the appropriate nt:folder node via the given path, and extracts some core atributes as well as sub nodes. Where can I find an example on how to provide a plain JackRabbit test case that can be deployed to an app server?
        Hide
        Robert Sauer added a comment -

        I tried to be conservative and not replace DefaultISMLocking. Instead I just installed a special ISMLocking instance for AbstractVersionManager which uses the XID aware implementation if an XID is present or falls back to DefaultISMLocking otherwise.

        This patch was created against the 1.6.1 tag. Test for both jackrabbit-core and jackrabbit-jca are passing.

        The initial implementation was using java.util.concurrent APIs and my ad hoc tests indicate this had a somewhat higher througput than the oswego-concurrent based implementation contained in this patch (200 TX/s vs 165 TX/s).

        With this patch applied my test driver could repeatedly perform about 45.000 TXs from 45 concurrent threads without deadlocking connections.

        Show
        Robert Sauer added a comment - I tried to be conservative and not replace DefaultISMLocking. Instead I just installed a special ISMLocking instance for AbstractVersionManager which uses the XID aware implementation if an XID is present or falls back to DefaultISMLocking otherwise. This patch was created against the 1.6.1 tag. Test for both jackrabbit-core and jackrabbit-jca are passing. The initial implementation was using java.util.concurrent APIs and my ad hoc tests indicate this had a somewhat higher througput than the oswego-concurrent based implementation contained in this patch (200 TX/s vs 165 TX/s). With this patch applied my test driver could repeatedly perform about 45.000 TXs from 45 concurrent threads without deadlocking connections.
        Hide
        Claus Köll added a comment -

        Could you attach your Implementation of ISM ?

        Show
        Claus Köll added a comment - Could you attach your Implementation of ISM ?
        Hide
        Robert Sauer added a comment -

        The test driver is relatively simple as it just involves multiple threads reading from the repository in parallel sessions. Downside is just that I'm currently using an in house developed library on top of JCR which unfortuantely I cannot pass around.

        Working on an isolated test driver that talks just plain JCR, though.

        Show
        Robert Sauer added a comment - The test driver is relatively simple as it just involves multiple threads reading from the repository in parallel sessions. Downside is just that I'm currently using an in house developed library on top of JCR which unfortuantely I cannot pass around. Working on an isolated test driver that talks just plain JCR, though.

          People

          • Assignee:
            Claus Köll
            Reporter:
            Robert Sauer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development