Derby
  1. Derby
  2. DERBY-3757

'ASSERT FAILED transaction table has null entry when running new StressMultiTest

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.5.1.1
    • Fix Version/s: 10.5.3.1, 10.6.1.0
    • Component/s: Store
    • Labels:
      None
    • Environment:
    • Urgency:
      Normal
    • Issue & fix info:
      High Value Fix

      Description

      When trying the DERBY-1764-V2.diff patch of DERBY-1764, I got this assertion running the test. It appears to be a bug iin Derby.

      1) testStressMulti(org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest)java.sql.SQLException: Java exception: 'ASSERT FAILED transaction table has null entry: org.apache.derby.shared.common.sanity.AssertFailure'.
      at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
      at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:87)
      at org.apache.derby.impl.jdbc.Util.javaException(Util.java:244)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:403)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
      at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2183)
      at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1325)
      at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625)
      at <unknown class>.<unknown method>(Unknown Source)
      at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.run(StressMultiTest.jav
      a:317)
      at java.lang.Thread.run(Thread.java:803)
      Caused by: org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED transaction table has null entry
      at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120)
      at org.apache.derby.impl.store.raw.xact.TransactionTable.getTransactionInfo(TransactionTable.java:968)
      at org.apache.derby.impl.store.raw.xact.XactFactory.getTransactionInfo(XactFactory.java:991)
      at org.apache.derby.impl.store.raw.RawStore.getTransactionInfo(RawStore.java:1153)
      at org.apache.derby.impl.store.access.RAMAccessManager.getTransactionInfo(RAMAccessManager.java:912)
      at org.apache.derby.impl.services.locks.Deadlock.buildException(Deadlock.java:266)
      at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:613)
      at org.apache.derby.impl.services.locks.AbstractPool.lockObject(AbstractPool.java:117)
      at org.apache.derby.impl.store.raw.xact.RowLocking3.lockRecordForWrite(RowLocking3.java:248)
      at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:504)
      at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
      at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:335)
      at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:628)
      at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:112)
      at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:304)
      at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1809)
      at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:680)
      at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(IndexRowToBaseRowResultSet.java:3
      73)
      at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:255)
      at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:186)
      at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
      at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:424)
      at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:246)
      at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:384)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1235)
      at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:175)
      at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.update(StressMultiTest.
      java:471)
      ... 2 more

      FAILURES!!!
      Tests run: 3, Failures: 0, Errors: 1

      1. Derby-3757_1.diff
        0.7 kB
        Erlend Birkenes
      2. testStressMulti.tar.gz
        2.07 MB
        Tiago R. Espinha
      3. sync_trans.diff
        1.0 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Kathey Marsden added a comment -

          This issue was discovered during the conversion of stress.multi (DERBY-1764)

          Show
          Kathey Marsden added a comment - This issue was discovered during the conversion of stress.multi ( DERBY-1764 )
          Hide
          Mike Matrigali added a comment -

          I don't think this code is doing the right kind of synchronization, but given that it is for diagnosics only I am not sure it is worth adding additional synchronization. Would appreciate another set of eyes to try and figure out how the null entry comes about.

          The code in public TransactionInfo[] getTransactionInfo(), synchronizes on "this"
          synchronized(this)
          {
          int ntran = trans.size();
          tinfo = new TransactionTableEntry[ntran];

          LogInstant logInstant = null;
          int i = 0;

          for (Enumeration e = trans.elements();
          e.hasMoreElements(); )

          { TransactionTableEntry ent = (TransactionTableEntry)e.nextElement(); if (ent != null) tinfo[i++] = (TransactionTableEntry)ent.clone(); if (SanityManager.DEBUG) SanityManager.ASSERT(ent != null, "transaction table has nul "); }

          }

          But the rest of the routine, like add and remove element seems to count on the automatic synchronization of the
          HashTable which is a field of this class. So could a concurrent remove or add to the hash table affect the
          middle of looping through the Enumeration and get a null?

          I would lean toward just getting rid of this assert for this routine which is used only for diagnostics like vti and error messages.

          Show
          Mike Matrigali added a comment - I don't think this code is doing the right kind of synchronization, but given that it is for diagnosics only I am not sure it is worth adding additional synchronization. Would appreciate another set of eyes to try and figure out how the null entry comes about. The code in public TransactionInfo[] getTransactionInfo(), synchronizes on "this" synchronized(this) { int ntran = trans.size(); tinfo = new TransactionTableEntry [ntran] ; LogInstant logInstant = null; int i = 0; for (Enumeration e = trans.elements(); e.hasMoreElements(); ) { TransactionTableEntry ent = (TransactionTableEntry)e.nextElement(); if (ent != null) tinfo[i++] = (TransactionTableEntry)ent.clone(); if (SanityManager.DEBUG) SanityManager.ASSERT(ent != null, "transaction table has nul "); } } But the rest of the routine, like add and remove element seems to count on the automatic synchronization of the HashTable which is a field of this class. So could a concurrent remove or add to the hash table affect the middle of looping through the Enumeration and get a null? I would lean toward just getting rid of this assert for this routine which is used only for diagnostics like vti and error messages.
          Hide
          Knut Anders Hatlen added a comment -

          It looks to me as if the rest of the class actually uses synchronization on "this" when entries are added to or removed from the hash table. The exception is if it happens during recovery (which is single-threaded and therefore OK), but it also looks like there's a code path from Xact.close() -> XactFactory.remove() -> Xact.remove() -> Hashtable.remove() that is not synchronized on "this" and which is likely causing the assert failure by removing an entry while another thread is iterating over the table.

          Although this failure is seen in debug code, I think it indicates that the synchronization scheme used in TransactionTable is flawed, so I would feel more comfortable if we fixed the synchronization. I have tried to understand how the synchronization in this class is supposed to work before, but with no success.

          What I have problem understanding, is when to synchronize on which object. The class javadoc says that the class depends on Hashtable synchronization, but says nothing about synchronization on "this". Some of the methods explicitly synchronize on "this", others on "trans", some rely on the implicit synchronization from the Hashtable class, and some have a synchronized (this) block enclosing a synchronized (trans) block. Some of the code even says that we need synchronization on "this" here without synchronizing on "this".

          Show
          Knut Anders Hatlen added a comment - It looks to me as if the rest of the class actually uses synchronization on "this" when entries are added to or removed from the hash table. The exception is if it happens during recovery (which is single-threaded and therefore OK), but it also looks like there's a code path from Xact.close() -> XactFactory.remove() -> Xact.remove() -> Hashtable.remove() that is not synchronized on "this" and which is likely causing the assert failure by removing an entry while another thread is iterating over the table. Although this failure is seen in debug code, I think it indicates that the synchronization scheme used in TransactionTable is flawed, so I would feel more comfortable if we fixed the synchronization. I have tried to understand how the synchronization in this class is supposed to work before, but with no success. What I have problem understanding, is when to synchronize on which object. The class javadoc says that the class depends on Hashtable synchronization, but says nothing about synchronization on "this". Some of the methods explicitly synchronize on "this", others on "trans", some rely on the implicit synchronization from the Hashtable class, and some have a synchronized (this) block enclosing a synchronized (trans) block. Some of the code even says that we need synchronization on "this" here without synchronizing on "this".
          Hide
          Erlend Birkenes added a comment -

          This patch just comments out the assert in org.apache.derby.impl.store.raw.xact.TransactionTable.

          Show
          Erlend Birkenes added a comment - This patch just comments out the assert in org.apache.derby.impl.store.raw.xact.TransactionTable.
          Hide
          Tiago R. Espinha added a comment -

          I have hit this bug when doing changes for DERBY-4217. It doesn't seem to be connected, but the failure has happened at a different place. The underlying error in derby.log seems to be the same though.

          Here's the stack trace for the failure:
          -----------8<---------------------
          junit.framework.AssertionFailedError: Caused by:
          java.sql.SQLNonTransientConnectionException: Network protocol exception: DSS length not 0 at end of same id chain parse. The connection has been terminated.
          at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
          at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358)
          at org.apache.derby.client.am.Statement.executeQuery(Statement.java:483)
          at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.select(StressMultiTest.java:535)
          at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.run(StressMultiTest.java:409)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: org.apache.derby.client.am.DisconnectException: Network protocol exception: DSS length not 0 at end of same id chain parse. The connection has been terminated.
          at org.apache.derby.client.net.Reply.endOfSameIdChainData(Reply.java:1174)
          at org.apache.derby.client.net.NetStatementReply.readOpenQuery(NetStatementReply.java:66)
          at org.apache.derby.client.net.StatementReply.readOpenQuery(StatementReply.java:50)
          at org.apache.derby.client.net.NetStatement.readOpenQuery_(NetStatement.java:156)
          at org.apache.derby.client.am.Statement.readOpenQuery(Statement.java:1478)
          at org.apache.derby.client.am.Statement.flowExecute(Statement.java:2095)
          at org.apache.derby.client.am.Statement.executeQueryX(Statement.java:489)
          at org.apache.derby.client.am.Statement.executeQuery(Statement.java:474)
          ... 3 more

          at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest.handleException(StressMultiTest.java:331)
          at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest.access$200(StressMultiTest.java:70)
          at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.run(StressMultiTest.java:425)
          at java.lang.Thread.run(Thread.java:619)

          Show
          Tiago R. Espinha added a comment - I have hit this bug when doing changes for DERBY-4217 . It doesn't seem to be connected, but the failure has happened at a different place. The underlying error in derby.log seems to be the same though. Here's the stack trace for the failure: ----------- 8< --------------------- junit.framework.AssertionFailedError: Caused by: java.sql.SQLNonTransientConnectionException: Network protocol exception: DSS length not 0 at end of same id chain parse. The connection has been terminated. at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70) at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358) at org.apache.derby.client.am.Statement.executeQuery(Statement.java:483) at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.select(StressMultiTest.java:535) at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.run(StressMultiTest.java:409) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.derby.client.am.DisconnectException: Network protocol exception: DSS length not 0 at end of same id chain parse. The connection has been terminated. at org.apache.derby.client.net.Reply.endOfSameIdChainData(Reply.java:1174) at org.apache.derby.client.net.NetStatementReply.readOpenQuery(NetStatementReply.java:66) at org.apache.derby.client.net.StatementReply.readOpenQuery(StatementReply.java:50) at org.apache.derby.client.net.NetStatement.readOpenQuery_(NetStatement.java:156) at org.apache.derby.client.am.Statement.readOpenQuery(Statement.java:1478) at org.apache.derby.client.am.Statement.flowExecute(Statement.java:2095) at org.apache.derby.client.am.Statement.executeQueryX(Statement.java:489) at org.apache.derby.client.am.Statement.executeQuery(Statement.java:474) ... 3 more at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest.handleException(StressMultiTest.java:331) at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest.access$200(StressMultiTest.java:70) at org.apache.derbyTesting.functionTests.tests.multi.StressMultiTest$StressMultiRunnable.run(StressMultiTest.java:425) at java.lang.Thread.run(Thread.java:619)
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2. As far as I understand, the database does not need to rebooted after this failure, so unchecking crash.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2. As far as I understand, the database does not need to rebooted after this failure, so unchecking crash.
          Hide
          Myrna van Lunteren added a comment -

          I got this stack running StressMultiTest pretty consistently after applying patch diff8 for DERBY-1465 to trunk updated to revision 827796, with ibm1.5 and sane jars.

          Show
          Myrna van Lunteren added a comment - I got this stack running StressMultiTest pretty consistently after applying patch diff8 for DERBY-1465 to trunk updated to revision 827796, with ibm1.5 and sane jars.
          Hide
          Knut Anders Hatlen added a comment -

          I've looked more at the synchronization in the TransactionTable class (see comment on DERBY-3092), and I believe that getTransactionInfo() should synchronize on "trans" instead of "this". Synchronization on "this" is used to prevent transaction table entries from changing status to update transactions, which is not important for that method. Instead, we should synchronize on "trans" to prevent entries from being removed between the calls to hasMoreElements() and nextElement().

          As to Mike's question about how nextElement() could end up returning null, I think the key is that the Enumeration object returned by Hashtable.elements() is not thread-safe. Although all of Hashtable's public methods are synchronized, none of the methods of the returned Enumeration object are synchronized. This means that there's nothing preventing Hashtable.remove() from being called and executed in one thread while another thread is executing nextElement().

          I looked at OpenJDK's implementation of Hashtable, and it looks to me that executing remove() while nextElement() is executing, depending on the exact timing, may result in nextElement() returning null or throwing NoSuchElementException. (The manifestation as a NoSuchElementException is logged as DERBY-3916.) If we synchronize on the Hashtable object, we would prevent both of these problems.

          Show
          Knut Anders Hatlen added a comment - I've looked more at the synchronization in the TransactionTable class (see comment on DERBY-3092 ), and I believe that getTransactionInfo() should synchronize on "trans" instead of "this". Synchronization on "this" is used to prevent transaction table entries from changing status to update transactions, which is not important for that method. Instead, we should synchronize on "trans" to prevent entries from being removed between the calls to hasMoreElements() and nextElement(). As to Mike's question about how nextElement() could end up returning null, I think the key is that the Enumeration object returned by Hashtable.elements() is not thread-safe. Although all of Hashtable's public methods are synchronized, none of the methods of the returned Enumeration object are synchronized. This means that there's nothing preventing Hashtable.remove() from being called and executed in one thread while another thread is executing nextElement(). I looked at OpenJDK's implementation of Hashtable, and it looks to me that executing remove() while nextElement() is executing, depending on the exact timing, may result in nextElement() returning null or throwing NoSuchElementException. (The manifestation as a NoSuchElementException is logged as DERBY-3916 .) If we synchronize on the Hashtable object, we would prevent both of these problems.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a patch that changes the synchronization in the method and adds a comment about why. I believe this patch fixes this issue and DERBY-3916, but I haven't run any tests yet.

          Show
          Knut Anders Hatlen added a comment - Here's a patch that changes the synchronization in the method and adds a comment about why. I believe this patch fixes this issue and DERBY-3916 , but I haven't run any tests yet.
          Hide
          Knut Anders Hatlen added a comment -

          Regression tests ran cleanly with the patch.

          Show
          Knut Anders Hatlen added a comment - Regression tests ran cleanly with the patch.
          Hide
          Knut Anders Hatlen added a comment -

          Committed revision 892912. If the fix doesn't cause any problems, and StressMultiTest does not run into this issue anymore, we should probably back-port it to 10.5 to make StressMultiTest run cleanly on that branch too. (StressMultiTest was introduced in 10.5.)

          Show
          Knut Anders Hatlen added a comment - Committed revision 892912. If the fix doesn't cause any problems, and StressMultiTest does not run into this issue anymore, we should probably back-port it to 10.5 to make StressMultiTest run cleanly on that branch too. (StressMultiTest was introduced in 10.5.)
          Hide
          Knut Anders Hatlen added a comment -

          Merged fix to 10.5 and committed revision 895609.

          Show
          Knut Anders Hatlen added a comment - Merged fix to 10.5 and committed revision 895609.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Kathey Marsden
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development