Issue Details (XML | Word | Printable)

Key: DERBY-3260
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Knut Anders Hatlen
Reporter: Robert Yokota
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Derby

NullPointerException caused by race condition in GenericActivationHolder

Created: 07/Dec/07 01:37 AM   Updated: 23/Jan/08 01:22 PM
Return to search
Component/s: SQL
Affects Version/s: 10.0.2.1, 10.1.1.0, 10.1.2.1, 10.1.3.1, 10.2.1.6, 10.2.2.0, 10.3.1.4
Fix Version/s: 10.3.3.0, 10.4.1.3

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works d3260.diff 2008-01-16 12:25 PM Knut Anders Hatlen 2 kB
File Licensed for inclusion in ASF works GenericActivationHolder.java.diff 2007-12-13 10:32 PM Robert Yokota 0.8 kB

Urgency: Blocker
Resolution Date: 23/Jan/08 01:22 PM


 Description  « Hide
I have a stress test using Derby 10.3.1.4 which is executing the same PreparedStatement using multiple threads concurrently and I consistently get the following NPE after several hours of running:

2007-12-07 00:48:10.914 GMT Thread[pool-5-thread-25,5,main] (XID = 1219661), (SESSIONID = 377), (DATABASE = /usr/ironhide/var/db/opera/derby), (DRDAID = null), Failed Statement is: select rdbmsvaria0_.GUID_AND_INDEX as GUID1_3_0_, rdbmsvaria0_.VALUE2 as VALUE2_3_0_, rdbmsvaria0_.HOLDER_GUID as HOLDER3_3_0_, rdbmsvaria0_.VALUE_TYPE as VALUE4_3_0_, rdbmsvaria0_.VALUE_GUID as VALUE5_3_0_, rdbmsvaria0_.DELETED as DELETED3_0_ from VARIABLE rdbmsvaria0_ where rdbmsvaria0_.GUID_AND_INDEX=? with 1 parameters begin parameter #1: 9C202AB9E8356288A9320C9C383A4D2F-11 :end parameter
java.lang.NullPointerException
        at org.apache.derby.impl.sql.GenericActivationHolder.execute(GenericActivationHolder.java:271)
        at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:368)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1203)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1652)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeQuery(EmbedPreparedStatement.java:275)
        at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:186)
        at org.hibernate.loader.Loader.getResultSet(Loader.java:1787)
        at org.hibernate.loader.Loader.doQuery(Loader.java:674)
        at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
        at org.hibernate.loader.Loader.loadEntity(Loader.java:1860)
        at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:48)
        at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:42)
        at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3044)
        at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:395)
        at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:375)
        at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:139)
        at org.hibernate.event.def.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:179)
        at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:103)
        at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:878)
        at org.hibernate.impl.SessionImpl.get(SessionImpl.java:815)
        at org.hibernate.impl.SessionImpl.get(SessionImpl.java:808)
        at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.hibernate.context.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:301)
        at $Proxy41.get(Unknown Source)
        at com.approuter.maestro.opera.rdbms.RdbmsContextHolder.getRdbmsVariable(RdbmsContextHolder.java:108)
        at com.approuter.maestro.opera.rdbms.RdbmsContextHolder.getVariable(RdbmsContextHolder.java:94)
        at com.approuter.maestro.vm.Frame.getParameter(Frame.java:218)
        at com.approuter.maestro.vm.Task.getParameter(Task.java:1267)
        at com.approuter.maestro.vm.CallContextImpl.setOutputParameter(CallContextImpl.java:195)
        at com.approuter.maestro.vm.CallContextImpl.getOutputParameterWriter(CallContextImpl.java:264)
        at com.approuter.maestro.sdk.mpi.DynamicExecutableActivity$3$1.getWriter(DynamicExecutableActivity.java:249)
        at com.approuter.module.xml.XmlSerializeTextActivity$XmlSerializeTextActivityInstance.serialize(XmlSerializeTextActivity.java:43)
        at sun.reflect.GeneratedMethodAccessor161.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.approuter.maestro.sdk.mpi.DynamicExecutableActivity.execute(DynamicExecutableActivity.java:332)
        at com.approuter.maestro.activities.Invoke.call(Invoke.java:199)
        at sun.reflect.GeneratedMethodAccessor157.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.approuter.maestro.activities.Instruction.call(Instruction.java:22)
        at com.approuter.maestro.vm.Program.call(Program.java:456)
        at com.approuter.maestro.vm.Task.run(Task.java:580)
        at com.approuter.maestro.vm.Task.run(Task.java:531)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
Cleanup action completed

Here is the relevant code from GenericActivationHolder.java

        // synchronized (ps)
        {
            /* Has the activation class changed? */
            if (gc != ps.getActivationClass())
            {

                // ensure the statement is valid by rePreparing it.
                ps.rePrepare(getLanguageConnectionContext()); // <-- line 260

                /*
                ** If we get here, it means the PreparedStatement has been
                ** recompiled. Get a new Activation and check whether the
                ** parameters are compatible. If so, transfer the parameters
                ** from the old Activation to the new one, and make that the
                ** current Activation. If not, throw an exception.
                */
                GeneratedClass newGC = ps.getActivationClass();

                BaseActivation newAC = (BaseActivation) newGC.newInstance(lcc); // <-- line 271

                DataTypeDescriptor[] newParamTypes = ps.getParameterTypes();


As you can see, the activation class returned by the ExecPreparedStatement is null as of line 271, even after having called a rePrepare(...) on the prepared statement (line 260). This is most likely due to a race condition. Also note that the containing block was formerly synchronized, but has since been commented out. From browsing the source in previous versions, the synchronized statement appears to have been commented out since before 10.0.2.1. It appears the fix is to reintroduce the synchronized block, or at least make the block inside the if statement synchronized.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Robert Yokota added a comment - 10/Dec/07 06:36 PM
I patched a custom build with the following change to the execute() method in GenericActivationHolder
and my stress test ran for over 24 hours without encountering an NPE, whereas before the change it
would get an NPE within about 9 hours.

The change is simply to synchronize on the prepared statement within the if statement.

    public ResultSet execute() throws StandardException
    {
        /*
        ** Synchronize to avoid problems if another thread is preparing
        ** the statement at the same time we're trying to execute it.
        */
        // synchronized (ps)
        {
            /* Has the activation class changed? */
            if (gc != ps.getActivationClass())
            {
              synchronized (ps) // <-- ADDED THIS LINE
              {
                // ensure the statement is valid by rePreparing it.
                ps.rePrepare(getLanguageConnectionContext());

                /*
                ** If we get here, it means the PreparedStatement has been
                ** recompiled. Get a new Activation and check whether the
                ** parameters are compatible. If so, transfer the parameters
                ** from the old Activation to the new one, and make that the
                ** current Activation. If not, throw an exception.
                */
                GeneratedClass newGC = ps.getActivationClass();

                BaseActivation newAC = (BaseActivation) newGC.newInstance(lcc);

                DataTypeDescriptor[] newParamTypes = ps.getParameterTypes();

                /*
                ** Link the new activation to the prepared statement.
                */
                newAC.setupActivation(ps, ac.getScrollable());

                newAC.setParameters(ac.getParameterValueSet(), paramTypes);


                /*
                ** IMPORTANT
                **
                ** Copy any essential state from the old activation
                ** to the new activation. This must match the state
                ** setup in EmbedStatement.
                ** singleExecution, cursorName, holdability, maxRows.
                */

                if (ac.isSingleExecution())
                    newAC.setSingleExecution();

                newAC.setCursorName(ac.getCursorName());

                newAC.setResultSetHoldability(ac.getResultSetHoldability());
                if (ac.getAutoGeneratedKeysResultsetMode()) //Need to do copy only if auto generated mode is on
                    newAC.setAutoGeneratedKeysResultsetInfo(ac.getAutoGeneratedKeysColumnIndexes(),
                    ac.getAutoGeneratedKeysColumnNames());
                newAC.setMaxRows(ac.getMaxRows());

                // break the link with the prepared statement
                ac.setupActivation(null, false);
                ac.close();

                /* Remember the new class information */
                ac = newAC;
                gc = newGC;
                paramTypes = newParamTypes;
              } // <-- ADDED THIS LINE
            }
        }

Thomas Nielsen added a comment - 10/Dec/07 06:51 PM
Just curious - do you see any performance drop from the added synchronization block, and if so how large?

Robert Yokota added a comment - 10/Dec/07 07:08 PM
I analyzed the numbers and there was no performance drop. This is most likely because the synchronized block inside the if statement is not happening that often.
It would probably be different if the synchronized statement occurred outside of the if statement, as before. I'm guessing the previous synchronized statement
was commented out for performance reasons. Unfortunately, it allows race conditions to occur now.

Knut Anders Hatlen added a comment - 10/Dec/07 07:31 PM
The discussion on DERBY-3024 may have some comments relevant to this issue. Do you have a repro that you can share with us? Does your stress test only execute a single statement, or is there some other background load (e.g., inserts, deletes, ddl)? I think it would be good if we could understand exactly what triggers the NPE. Then it's easier to see if this is the correct way to fix it.

Robert Yokota added a comment - 10/Dec/07 07:56 PM
Unfortunately, I don't have a test case to share with you. Yes, there is quite a bit of other activity (inserts, deletes, selects, etc).

However, looking at the code it is easy to see how such a race condition might occur. My observations:

1) The code is failing because ps.getActivationClass returns null after a ps.rePrepare
2) The only place that calls setActivationClass(null) on a ps is GenericStatement.prepMinion
3) GenericStatement.prepMinion is called from GenericStatement.prepare, which is called from GenericPreparedStatement.rePrepare, so it looks like it may be a race condition between two concurrent calls to GenericPreparedStatement.rePrepare.
4) If you look at other code in GenericPreparedStatement, you see the following, which indicates that it may be a good idea to synchronized on the ps when calling ps.rePrepare and then ps.getActivationClass immediately after.

    public Activation getActivation(LanguageConnectionContext lcc,
                                    boolean scrollable)
        throws StandardException
    {
        Activation ac;
        synchronized (this) {
            GeneratedClass gc = getActivationClass();

            if (gc == null) {
                rePrepare(lcc);
                gc = getActivationClass();
            }

            ac = new GenericActivationHolder(lcc, gc, this, scrollable);

            inUseCount++;
        }
        // DERBY-2689. Close unused activations-- this method should be called
        // when I'm not holding a lock on a prepared statement to avoid
        // deadlock.
        lcc.closeUnusedActivations();

        return ac;
    }

5) I'm not an expert on this code so the above is just conjecture :)

Thomas Nielsen added a comment - 13/Dec/07 02:54 PM
Robert,

Could you please post a proper patch with your proposed changes for others to test and review as well?
It would also be good to assign this issue to yourself to indicate that you are working on it.

Robert Yokota added a comment - 13/Dec/07 10:35 PM
I've attached a patch file. I'm not working on this any longer -- I've suggested the fix (just two lines) and my company has successfully run hundreds of tests with it. I would suggest someone else verify my analysis of the race condition and then commit the change.

Knut Anders Hatlen added a comment - 02/Jan/08 12:49 PM
The fix looks correct to me. We could probably make the synchronized block even smaller, since the last part of it only modifies the state of the activation, which is not shared between statements.

It would be interesting, though, if someone could shed some light on why the synchronization was commented out in the first place. This piece of code hasn't been changed since the code was donated to Apache, so the commit log doesn't tell us anything about it.

The synchronization statement that is commented out, enclosed the entire if statement, whereas the suggested fix only synchronizes the body of the if statement. I'm not sure, but I think this is fine, since the only risk is that we miss that someone else has recompiled/invalidated the plan, for which there are plenty of tests when we start executing the plan.

Knut Anders Hatlen added a comment - 16/Jan/08 12:25 PM
I think this patch (d3260.diff) which narrows down the synchronization block so that it only encloses the calls to rePrepare() and getActivationClass(), is sufficient to fix the NPE, but I can't tell for sure due to the lack of a reproducible test case.

Dyre Tjeldvoll added a comment - 17/Jan/08 03:58 PM
My take on this is that if your patch passes the tests (and there are no other ill effects), we should just commit it and then close the issue.
Should it re-surface we can always reopen, or create another issue.

Dyre Tjeldvoll added a comment - 17/Jan/08 03:58 PM
Since he created the latest patch

Knut Anders Hatlen added a comment - 21/Jan/08 09:49 AM
Committed revision 613815.

Will wait a few days before I back-port the fix to 10.3.

Knut Anders Hatlen added a comment - 23/Jan/08 01:22 PM
Merged to 10.3 (revision 614530)