Derby
  1. Derby
  2. DERBY-5603

EmbedConnection.clearLOBMapping() incorrectly clears lobFiles causing a ConcurrentModificationException

    Details

    • Urgency:
      Urgent
    • Bug behavior facts:
      Crash

      Description

      In EmbedConnection.clearLOBMapping() the code which iterates over lobFiles has a finally block which clears the Set. This causes a ConcurrentModificationException to be thrown and even using a concurrent data structure would still result in only one LOBFile being correctly closed.

      This will occur anytime the lobFiles Set contains more than 1 LOBFile.

      Stack Trace:
      java.sql.SQLException: Java exception: ': java.util.ConcurrentModificationException'.
      at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.Util.javaException(Unknown Source)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedConnection.commit(Unknown Source)
      <lines removed>
      Caused by: java.sql.SQLException: Java exception: ': java.util.ConcurrentModificationException'.
      at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
      at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
      ... 16 more
      Caused by: java.util.ConcurrentModificationException
      at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
      at java.util.HashMap$KeyIterator.next(Unknown Source)
      at org.apache.derby.impl.jdbc.EmbedConnection.clearLOBMapping(Unknown Source)
      ... 10 more

        Issue Links

          Activity

          Hide
          Kristian Waagan added a comment -

          Attaching patch 1a, which addresses the problem reported.
          The simplest fix is to just move the finally-block to an outer try-block. I changed a little more:
          o don't call clear in a finally block (consistent with clearing other maps, not sure if this is the best choice)
          o try to close all file handles, even if an IOException is encountered (the first exception is saved and thrown later, subsequent exceptions are discarded)
          o made getLobHMObj private
          o adjusted comment, I can't see that the code resets the locator value nor nulifies the map reference.

          I'm having trouble writing a test for this issue. In my attempts the cleanup has been proper, so there is nothing left to do when the code reaches the point where the lobFiles-map is cleared. I don't see the point in trying to write a test where the VM is pushed close to its limits to force the WeakHashMap-entry to be removed and the finalization to be delayed, if that's what it takes to trigger this bug.

          Jon, do you have a simple repro for this issue?

          I'm running the full regression suite now.

          Note: The patch is generated with git, use -p1 when applying with patch.

          Show
          Kristian Waagan added a comment - Attaching patch 1a, which addresses the problem reported. The simplest fix is to just move the finally-block to an outer try-block. I changed a little more: o don't call clear in a finally block (consistent with clearing other maps, not sure if this is the best choice) o try to close all file handles, even if an IOException is encountered (the first exception is saved and thrown later, subsequent exceptions are discarded) o made getLobHMObj private o adjusted comment, I can't see that the code resets the locator value nor nulifies the map reference. I'm having trouble writing a test for this issue. In my attempts the cleanup has been proper, so there is nothing left to do when the code reaches the point where the lobFiles-map is cleared. I don't see the point in trying to write a test where the VM is pushed close to its limits to force the WeakHashMap-entry to be removed and the finalization to be delayed, if that's what it takes to trigger this bug. Jon, do you have a simple repro for this issue? I'm running the full regression suite now. Note: The patch is generated with git, use -p1 when applying with patch.
          Hide
          Jon Steinich added a comment -

          I haven't verified a simple reproduce case, but pre-patch it should have failed whenever there was more than one entry in the lobFIles Set, so I believe creating 2 blobs in a single transaction would cause it.

          Show
          Jon Steinich added a comment - I haven't verified a simple reproduce case, but pre-patch it should have failed whenever there was more than one entry in the lobFIles Set, so I believe creating 2 blobs in a single transaction would cause it.
          Hide
          Kristian Waagan added a comment -

          The "problem" is that the entries are removed before Derby reaches the code that fails. I can easily provoke the error if I modify the source code.
          Tthere's a WeakHashMap and a finalizer involved, which makes things a bit tricky - at least on my system.

          Are you seeing the issue during high load, or do you know if memory is scarce in the VM when this happens?
          Are you simply creating new LOBs, or are you both growing and shrinking them in the same transaction? There may be a specific access pattern that triggers the bug.

          I see you mention Blob again. I tried to reproduce with Clob. Although this code is shared between the two, there may a bug in the Blob-implementation that triggers the issue. I'll try again with a Blob-repro.

          Show
          Kristian Waagan added a comment - The "problem" is that the entries are removed before Derby reaches the code that fails. I can easily provoke the error if I modify the source code. Tthere's a WeakHashMap and a finalizer involved, which makes things a bit tricky - at least on my system. Are you seeing the issue during high load, or do you know if memory is scarce in the VM when this happens? Are you simply creating new LOBs, or are you both growing and shrinking them in the same transaction? There may be a specific access pattern that triggers the bug. I see you mention Blob again. I tried to reproduce with Clob. Although this code is shared between the two, there may a bug in the Blob-implementation that triggers the issue. I'll try again with a Blob-repro.
          Hide
          Jon Steinich added a comment -

          Ah, I didn't realize they were generally removed prior to that point. Now something that seemed irrelevant before is likely the key...
          We have 2 connections open to the database; one for reading and one for writing. The read connection is probably reading in a blob (creating one) while the commit is clearing on the write connection.

          It wasn't under high load or low memory when the error occurred. We always call createBlob() and then write to its output stream.

          Show
          Jon Steinich added a comment - Ah, I didn't realize they were generally removed prior to that point. Now something that seemed irrelevant before is likely the key... We have 2 connections open to the database; one for reading and one for writing. The read connection is probably reading in a blob (creating one) while the commit is clearing on the write connection. It wasn't under high load or low memory when the error occurred. We always call createBlob() and then write to its output stream.
          Hide
          Kristian Waagan added a comment -

          Thank you for the feedback.

          It's hard to say exactly what's going on without a repro, but:
          o the maps are private to each connection
          o the code you pointed at is obviously broken, and will fail if a map with more than one entry is cleared
          o access to the map is synchronized on the connection object

          Currently I can only think of the situation where the entries haven't been cleared the "normal" way for whatever reason, and then the clearing fails due to the broken code.
          I'll wait for more input from the community, and then commit a patch along the lines of the current suggestion/patch if nothing substantial comes up.

          Finally, how often do you see this bug?
          Are you running with isolation level read-uncommitted? (may not be relevant, but there are some issues with LOBs and locking at that level)
          Are you in a situation where you could test a build off the head of the 10.8 branch?
          That is, 10.8.2.2 + all fixes that have been backported since the release. Building the jars with 10.8.2.2 + this fix only is also possible of course. None of these have been tested thoroughly by the community.

          Show
          Kristian Waagan added a comment - Thank you for the feedback. It's hard to say exactly what's going on without a repro, but: o the maps are private to each connection o the code you pointed at is obviously broken, and will fail if a map with more than one entry is cleared o access to the map is synchronized on the connection object Currently I can only think of the situation where the entries haven't been cleared the "normal" way for whatever reason, and then the clearing fails due to the broken code. I'll wait for more input from the community, and then commit a patch along the lines of the current suggestion/patch if nothing substantial comes up. Finally, how often do you see this bug? Are you running with isolation level read-uncommitted? (may not be relevant, but there are some issues with LOBs and locking at that level) Are you in a situation where you could test a build off the head of the 10.8 branch? That is, 10.8.2.2 + all fixes that have been backported since the release. Building the jars with 10.8.2.2 + this fix only is also possible of course. None of these have been tested thoroughly by the community.
          Hide
          Jon Steinich added a comment -

          It isn't that common. I'd say maybe once every 500 transactions (a bit over an hour). We are still doing testing at the moment so we haven't put a consistent load on it.

          The write connection is using TRANSACTION_READ_COMMITTED and the read connection is using TRANSACTION_READ_UNCOMMITTED.

          I should be able to test a build from head. I haven't built from source before, but it looked to be fairly straight forward.

          Show
          Jon Steinich added a comment - It isn't that common. I'd say maybe once every 500 transactions (a bit over an hour). We are still doing testing at the moment so we haven't put a consistent load on it. The write connection is using TRANSACTION_READ_COMMITTED and the read connection is using TRANSACTION_READ_UNCOMMITTED. I should be able to test a build from head. I haven't built from source before, but it looked to be fairly straight forward.
          Hide
          Jon Steinich added a comment -

          I have started using the head build + this fix. I will report back later with any findings.

          Show
          Jon Steinich added a comment - I have started using the head build + this fix. I will report back later with any findings.
          Hide
          Kristian Waagan added a comment -

          Great, thanks for testing the fix

          Show
          Kristian Waagan added a comment - Great, thanks for testing the fix
          Hide
          Kristian Waagan added a comment -

          Jon, any updates on the fix?
          I plan to commit it shortly.

          Show
          Kristian Waagan added a comment - Jon, any updates on the fix? I plan to commit it shortly.
          Hide
          Jon Steinich added a comment -

          We haven't had any issues since switching to that build.
          Thanks

          Show
          Jon Steinich added a comment - We haven't had any issues since switching to that build. Thanks
          Hide
          Kristian Waagan added a comment -

          Commited patch 1a to trunk with revision 1294512. I plan to backport this fix.

          Show
          Kristian Waagan added a comment - Commited patch 1a to trunk with revision 1294512. I plan to backport this fix.
          Hide
          Kristian Waagan added a comment -

          Thank you for testing the fix and reporting back, Jon

          Show
          Kristian Waagan added a comment - Thank you for testing the fix and reporting back, Jon
          Hide
          Kristian Waagan added a comment -

          I plan to backport this all the way back. I have run suites.All on 10.3, 10.6, 10.7 and 10.8.

          Show
          Kristian Waagan added a comment - I plan to backport this all the way back. I have run suites.All on 10.3, 10.6, 10.7 and 10.8.
          Hide
          Kristian Waagan added a comment -

          Backported fix from trunk with the following revisions:
          o 10.8: 1298735
          o 10.7: 1298740
          o 10.6: 1298743
          o 10.5: 1298750
          o 10.4: 1298751
          o 10.3: 1298752

          Resolving issue.

          Show
          Kristian Waagan added a comment - Backported fix from trunk with the following revisions: o 10.8: 1298735 o 10.7: 1298740 o 10.6: 1298743 o 10.5: 1298750 o 10.4: 1298751 o 10.3: 1298752 Resolving issue.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update: close all resolved issues that haven't had any activity the last year]

          Show
          Knut Anders Hatlen added a comment - [bulk update: close all resolved issues that haven't had any activity the last year]

            People

            • Assignee:
              Kristian Waagan
              Reporter:
              Jon Steinich
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development