Derby
  1. Derby
  2. DERBY-5894

NPE in OnlineBackupTest1 while backing up in stubFileToRemoveAfterCheckPoint

    Details

    • Urgency:
      Normal
    • Bug behavior facts:
      Regression Test Failure

      Description

      Saw this while running OnlineBackupTest1 trying to repro DERBY-973.

      ERROR 38000: The exception 'java.lang.NullPointerException' was thrown while evaluating an expression.
      at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:288)
      at org.apache.derby.iapi.error.StandardException.unexpectedUserException(StandardException.java:575)
      at org.apache.derby.impl.services.reflect.ReflectMethod.invoke(ReflectMethod.java:63)
      at org.apache.derby.impl.sql.execute.CallStatementResultSet.open(CallStatementResultSet.java:75)
      at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:443)
      at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:324)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1715)
      at org.apache.derby.impl.jdbc.EmbedCallableStatement.executeStatement(EmbedCallableStatement.java:118)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(EmbedPreparedStatement.java:1370)
      at org.apache.derbyTesting.functionTests.tests.store.OnlineBackup.performBackup(OnlineBackup.java:89)
      at org.apache.derbyTesting.functionTests.tests.store.OnlineBackup.run(OnlineBackup.java:60)
      at java.lang.Thread.run(Thread.java:637)
      Caused by: java.lang.NullPointerException
      at java.util.Hashtable.put(Hashtable.java:401)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.stubFileToRemoveAfterCheckPoint(BaseDataFileFactory.java:1613)
      at org.apache.derby.impl.store.raw.data.RAFContainer.run(RAFContainer.java:1651)
      at java.security.AccessController.doPrivileged(Native Method)
      at org.apache.derby.impl.store.raw.data.RAFContainer.backupContainer(RAFContainer.java:983)
      at org.apache.derby.impl.store.raw.data.BaseContainerHandle.backupContainer(BaseContainerHandle.java:1031)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.backupDataFiles(BaseDataFileFactory.java:2466)
      at org.apache.derby.impl.store.raw.RawStore.backup(RawStore.java:978)
      at org.apache.derby.impl.store.raw.RawStore.backup(RawStore.java:649)
      at org.apache.derby.impl.store.access.RAMAccessManager.backup(RAMAccessManager.java:964)
      at org.apache.derby.impl.db.BasicDatabase.backup(BasicDatabase.java:430)
      at org.apache.derby.catalog.SystemProcedures.SYSCS_BACKUP_DATABASE(SystemProcedures.java:961)
      at org.apache.derby.exe.acf33d40c7x0138x6d7cx6df3x0000720ad17b0.g0(Unknown Source)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:592)
      at org.apache.derby.impl.services.reflect.ReflectMethod.invoke(ReflectMethod.java:46)
      ... 10 more

      1. d5894-1b-minimal-privileges.diff
        10 kB
        Knut Anders Hatlen
      2. d5894-1a-minimal-change.diff
        3 kB
        Knut Anders Hatlen
      3. d5894.zip
        626 kB
        Dag H. Wanvik

        Activity

        Hide
        Dag H. Wanvik added a comment - - edited

        Attaching db files in d5894.zip (never mind the confusing top directory - cruft)

        Show
        Dag H. Wanvik added a comment - - edited Attaching db files in d5894.zip (never mind the confusing top directory - cruft)
        Hide
        Dag H. Wanvik added a comment -

        I checked the source for HashTable, and the NPE means the key is null, i.e. logInstant argument to stubFileToRemoveAfterCheckPoint.

        Show
        Dag H. Wanvik added a comment - I checked the source for HashTable, and the NPE means the key is null, i.e. logInstant argument to stubFileToRemoveAfterCheckPoint.
        Hide
        Mike Matrigali added a comment -

        triage for 10.10

        next step might be to run the test a number of times to get some idea how reproducible it
        is.

        Show
        Mike Matrigali added a comment - triage for 10.10 next step might be to run the test a number of times to get some idea how reproducible it is.
        Hide
        Knut Anders Hatlen added a comment -

        The stack trace says:

        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.stubFileToRemoveAfterCheckPoint(BaseDataFileFactory.java:1613)
        at org.apache.derby.impl.store.raw.data.RAFContainer.run(RAFContainer.java:1651)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.derby.impl.store.raw.data.RAFContainer.backupContainer(RAFContainer.java:983)

        However, backupContainer()'s call to doPrivileged() is supposed to end up calling privBackupContainer(), not stubFileToRemoveAfterCheckPoint(). I think this may happen if backupContainer() is called concurrently with one of the other RAFContainer methods that call doPrivileged(). Since backupContainer() is not synchronized, a call to one of the other RAFContainer methods (in the case reported here: RAFContainer.stubbify()) in another thread may proceed and change the value of actionCode while backupContainer() is executing. This makes the switch statement in RAFContainer.run() pick the wrong action, and if the execution of stubbify() has come so far as to resetting the actionInstant field to null before backupContainer() has completed, backupContainer() may fail with the reported NullPointerException.

        The fix would be to synchronize backupContainer(). All the other methods in RAFContainer that call doPrivileged(), are already synchronized.

        Show
        Knut Anders Hatlen added a comment - The stack trace says: at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.stubFileToRemoveAfterCheckPoint(BaseDataFileFactory.java:1613) at org.apache.derby.impl.store.raw.data.RAFContainer.run(RAFContainer.java:1651) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.RAFContainer.backupContainer(RAFContainer.java:983) However, backupContainer()'s call to doPrivileged() is supposed to end up calling privBackupContainer(), not stubFileToRemoveAfterCheckPoint(). I think this may happen if backupContainer() is called concurrently with one of the other RAFContainer methods that call doPrivileged(). Since backupContainer() is not synchronized, a call to one of the other RAFContainer methods (in the case reported here: RAFContainer.stubbify()) in another thread may proceed and change the value of actionCode while backupContainer() is executing. This makes the switch statement in RAFContainer.run() pick the wrong action, and if the execution of stubbify() has come so far as to resetting the actionInstant field to null before backupContainer() has completed, backupContainer() may fail with the reported NullPointerException. The fix would be to synchronize backupContainer(). All the other methods in RAFContainer that call doPrivileged(), are already synchronized.
        Hide
        Knut Anders Hatlen added a comment -

        Synchronizing backupContainer() might fix this particular NPE, but it made some tests trip over an assert in RAFContainer4.readPage(long,byte[],long). The problem is that readPage() checks whether the call is synchronized, and behaves differently if it is. And backupContainer() will end up calling readPage(), but now synchronized instead of unsynchronized, and this was unexpected. See the description of "stealth mode" in DERBY-4741 for details.

        Synchronization in backupContainer() is only needed in order to protect the actionCode, actionContainerHandle and actionBackupLocation variables, and those variables are only needed because all of RAFContainer's privileged operations share one run() method, I think it's just as reasonable to keep it unsynchronized and instead let backupContainer() have its own run() method in an inner class. I even think that it's a cleaner approach in general, both because it avoids this class of bugs, and because it's more scalable as it avoids a global lock on the container for each privileged operation (not that relevant for backup, though, but still...).

        The tests that failed when RAFContainer.backupContainer() was made synchronized, were:

        lang.DatabaseClassLoadingTest
        lang.TruncateTableAndOnlineBackupTest

        derbyall/storeall/storeall.fail:store/rollForwardBackup.sql
        derbyall/storeall/storeall.fail:store/rollForwardRecovery.sql
        derbyall/storeall/storeall.fail:store/backupRestore.sql
        derbyall/storeall/storeall.fail:store/backupRestore1.java
        derbyall/storeall/storeall.fail:store/OnlineBackupTest1.java
        derbyall/storeall/storeall.fail:store/OnlineBackupTest3.java
        derbyall/storeall/storeall.fail:storetests/st_1.sql
        derbyall/encryptionAll/encryptionAll.fail:store/encryptDatabaseTest3.sql

        Show
        Knut Anders Hatlen added a comment - Synchronizing backupContainer() might fix this particular NPE, but it made some tests trip over an assert in RAFContainer4.readPage(long,byte[],long). The problem is that readPage() checks whether the call is synchronized, and behaves differently if it is. And backupContainer() will end up calling readPage(), but now synchronized instead of unsynchronized, and this was unexpected. See the description of "stealth mode" in DERBY-4741 for details. Synchronization in backupContainer() is only needed in order to protect the actionCode, actionContainerHandle and actionBackupLocation variables, and those variables are only needed because all of RAFContainer's privileged operations share one run() method, I think it's just as reasonable to keep it unsynchronized and instead let backupContainer() have its own run() method in an inner class. I even think that it's a cleaner approach in general, both because it avoids this class of bugs, and because it's more scalable as it avoids a global lock on the container for each privileged operation (not that relevant for backup, though, but still...). The tests that failed when RAFContainer.backupContainer() was made synchronized, were: lang.DatabaseClassLoadingTest lang.TruncateTableAndOnlineBackupTest derbyall/storeall/storeall.fail:store/rollForwardBackup.sql derbyall/storeall/storeall.fail:store/rollForwardRecovery.sql derbyall/storeall/storeall.fail:store/backupRestore.sql derbyall/storeall/storeall.fail:store/backupRestore1.java derbyall/storeall/storeall.fail:store/OnlineBackupTest1.java derbyall/storeall/storeall.fail:store/OnlineBackupTest3.java derbyall/storeall/storeall.fail:storetests/st_1.sql derbyall/encryptionAll/encryptionAll.fail:store/encryptDatabaseTest3.sql
        Hide
        Knut Anders Hatlen added a comment -

        I'm attaching two alternative fixes for this issue.

        d5894-1a-minimal-change.diff implements the minimal change to resolve the race condition. It makes backupContainer() have its own privileged action, so that there's no state shared with the other privileged operations, and hence no synchronization needed since there is no race.

        However, the current approach (not changed by the 1a patch) invokes the entire privBackupContainer() method in a privileged block. It's a relatively big method, and only some portions actually need to run with privileges.

        The alternative patch d5894-1b-minimal-privileges.diff changes backupContainer() so that it doesn't invoke all code in a privileged block. Instead, it uses helper methods with small privileged blocks that do as little as possible apart from invoking the operations that need privileges. In some cases existing helper methods could be used, but it also needed to add some new ones (for copying a file, removing a file, and creating a RandomAccessFile).

        Although the 1b patch adds more code than the 1a patch, I prefer the 1b patch, as it reduces the risk of accidentally giving permissions to code that's not supposed to have them. Both patches should solve the race condition, as they both ensure that backupContainer() does not use any shared state without synchronization. (I haven't reproduced the NPE, though, so I can't verify it.)

        Regression tests ran cleanly with both of the patches.

        Show
        Knut Anders Hatlen added a comment - I'm attaching two alternative fixes for this issue. d5894-1a-minimal-change.diff implements the minimal change to resolve the race condition. It makes backupContainer() have its own privileged action, so that there's no state shared with the other privileged operations, and hence no synchronization needed since there is no race. However, the current approach (not changed by the 1a patch) invokes the entire privBackupContainer() method in a privileged block. It's a relatively big method, and only some portions actually need to run with privileges. The alternative patch d5894-1b-minimal-privileges.diff changes backupContainer() so that it doesn't invoke all code in a privileged block. Instead, it uses helper methods with small privileged blocks that do as little as possible apart from invoking the operations that need privileges. In some cases existing helper methods could be used, but it also needed to add some new ones (for copying a file, removing a file, and creating a RandomAccessFile). Although the 1b patch adds more code than the 1a patch, I prefer the 1b patch, as it reduces the risk of accidentally giving permissions to code that's not supposed to have them. Both patches should solve the race condition, as they both ensure that backupContainer() does not use any shared state without synchronization. (I haven't reproduced the NPE, though, so I can't verify it.) Regression tests ran cleanly with both of the patches.
        Hide
        Knut Anders Hatlen added a comment -

        I've committed the 1b patch, revision 1422845.

        Show
        Knut Anders Hatlen added a comment - I've committed the 1b patch, revision 1422845.
        Hide
        Mamta A. Satoor added a comment -

        Temporarily assigning to me for the backport

        Show
        Mamta A. Satoor added a comment - Temporarily assigning to me for the backport
        Hide
        ASF subversion and git services added a comment -

        Commit 1504928 from Mamta A. Satoor in branch 'code/branches/10'
        [ https://svn.apache.org/r1504928 ]

        DERBY-5894(NPE in OnlineBackupTest1 while backing up in stubFileToRemoveAfterCheckPoint)

        Backporting to 10.9 Fix contributed by Knut.

        Show
        ASF subversion and git services added a comment - Commit 1504928 from Mamta A. Satoor in branch 'code/branches/10' [ https://svn.apache.org/r1504928 ] DERBY-5894 (NPE in OnlineBackupTest1 while backing up in stubFileToRemoveAfterCheckPoint) Backporting to 10.9 Fix contributed by Knut.
        Hide
        Mamta A. Satoor added a comment -

        I tried to backport this jira (it backported fine on 10.9) to 10.8, but it turning out not to be simple svn merge. svn merge was running into conflict so I hand made the changes in 10.8 code. But this changed code is relying on code changes that are not available in 10.8(those changes went in as part of a huge check in for DERBY-5363 and this jira has not made in 10.8. So, we will need to backport only partial changes from DERBY-5363 to 10.8). Additionally, that partial change from DERBY-5363 is using Java functionality that is not available with 10.8 JDK and hence will not compile in 10.8 without hand changing the code in quite a few places. I will go ahead and unassign myself from this jira.So far, it has been backported to 10.9. I don't plan on backporting to 10.8 right now.

        Show
        Mamta A. Satoor added a comment - I tried to backport this jira (it backported fine on 10.9) to 10.8, but it turning out not to be simple svn merge. svn merge was running into conflict so I hand made the changes in 10.8 code. But this changed code is relying on code changes that are not available in 10.8(those changes went in as part of a huge check in for DERBY-5363 and this jira has not made in 10.8. So, we will need to backport only partial changes from DERBY-5363 to 10.8). Additionally, that partial change from DERBY-5363 is using Java functionality that is not available with 10.8 JDK and hence will not compile in 10.8 without hand changing the code in quite a few places. I will go ahead and unassign myself from this jira.So far, it has been backported to 10.9. I don't plan on backporting to 10.8 right now.
        Hide
        Knut Anders Hatlen added a comment -

        [bulk update] Close all resolved issues that haven't been updated for more than one year.

        Show
        Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Dag H. Wanvik
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development