Derby
  1. Derby
  2. DERBY-5312

InterruptResilienceTest failed with ERROR 40XD1: Container was opened in read-only mode.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.9.1.0
    • Fix Version/s: 10.8.2.2, 10.9.1.0
    • Component/s: Store
    • Labels:
      None
    • Environment:
      Oracle Solaris 11 Express snv_151a X86
      java version "1.7.0"
      Java(TM) SE Runtime Environment (build 1.7.0-b147)
      Java HotSpot(TM) Server VM (build 21.0-b17, mixed mode)
    • Bug behavior facts:
      Regression Test Failure

      Description

      WorkerThread failed with this exception:

      ERROR 40XD1: Container was opened in read-only mode.
      at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:276)
      at org.apache.derby.impl.store.raw.data.BaseContainer.use(BaseContainer.java:562)
      at org.apache.derby.impl.store.raw.data.BaseContainerHandle.useContainer(BaseContainerHandle.java:834)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.openContainer(BaseDataFileFactory.java:773)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.openContainer(BaseDataFileFactory.java:589)
      at org.apache.derby.impl.store.raw.xact.Xact.openContainer(Xact.java:1316)
      at org.apache.derby.impl.store.access.btree.OpenBTree.init(OpenBTree.java:382)
      at org.apache.derby.impl.store.access.btree.BTreeController.init(BTreeController.java:1225)
      at org.apache.derby.impl.store.access.btree.index.B2IController.init(B2IController.java:140)
      at org.apache.derby.impl.store.access.btree.index.B2I.open(B2I.java:824)
      at org.apache.derby.impl.store.access.RAMTransaction.openConglomerate(RAMTransaction.java:476)
      at org.apache.derby.impl.store.access.RAMTransaction.openCompiledConglomerate(RAMTransaction.java:1293)
      at org.apache.derby.impl.sql.execute.IndexChanger.openIndexCC(IndexChanger.java:507)
      at org.apache.derby.impl.sql.execute.IndexChanger.insertAndCheckDups(IndexChanger.java:438)
      at org.apache.derby.impl.sql.execute.IndexChanger.doInsert(IndexChanger.java:383)
      at org.apache.derby.impl.sql.execute.IndexChanger.insert(IndexChanger.java:590)
      at org.apache.derby.impl.sql.execute.IndexSetChanger.insert(IndexSetChanger.java:268)
      at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(RowChangerImpl.java:453)
      at org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(InsertResultSet.java:999)
      at org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:519)
      at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
      at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
      at org.apache.derbyTesting.functionTests.tests.store.InterruptResilienceTest$WorkerThread.run(InterruptResilienceTest.java:449)

      I was testing a patch for DERBY-4620 (estimate-sizes.diff), but I think the failure isn't related to those changes.

      1. derby.log
        1.92 MB
        Knut Anders Hatlen
      2. error-stacktrace.out
        10 kB
        Knut Anders Hatlen
      3. derby-5312a.diff
        1 kB
        Dag H. Wanvik
      4. derby-5312-simplify-reopen-1.diff
        10 kB
        Dag H. Wanvik
      5. derby-5312-simplify-reopen-1.stat
        0.1 kB
        Dag H. Wanvik

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Uploading derby.log and stack trace from the failing test run.

          Show
          Knut Anders Hatlen added a comment - Uploading derby.log and stack trace from the failing test run.
          Hide
          Dag H. Wanvik added a comment -

          I was able to reproduce this in one out of 100 runs on my laptop. I think it is a side effect of something being thrown in privGetJBMSLockOnDB, cf this the code fragment ca line 1933. I'll try to instrument it.

          catch (IOException ioe)
          {
          // probably a read only db, don't do anything more
          readOnly = true;

          Show
          Dag H. Wanvik added a comment - I was able to reproduce this in one out of 100 runs on my laptop. I think it is a side effect of something being thrown in privGetJBMSLockOnDB, cf this the code fragment ca line 1933. I'll try to instrument it. catch (IOException ioe) { // probably a read only db, don't do anything more readOnly = true;
          Hide
          Dag H. Wanvik added a comment -

          With this trace in FileContainer#canUpdate:

          protected boolean canUpdate() {
          if (!canUpdate)

          { System.err.println("FileContainer#canUpdate: " + canUpdate); }

          return canUpdate;
          }

          I see 3 out of 700 runs:

          ..FileContainer#canUpdate: true
          ..........

          I.e. a race: !canUpdate -> canUpdate, but how? Btw, the value is not printed at "false" on either of the three hits.

          Show
          Dag H. Wanvik added a comment - With this trace in FileContainer#canUpdate: protected boolean canUpdate() { if (!canUpdate) { System.err.println("FileContainer#canUpdate: " + canUpdate); } return canUpdate; } I see 3 out of 700 runs: ..FileContainer#canUpdate: true .......... I.e. a race: !canUpdate -> canUpdate, but how? Btw, the value is not printed at "false" on either of the three hits.
          Hide
          Dag H. Wanvik added a comment - - edited

          Under reopening a container there is a small time-window inside in RAFContainer#run(OPEN_CONTAINER_ACTION):
          :
          canUpdate = false;
          try {
          boolean b1 = !dataFactory.isReadOnly();
          boolean b2 = file.canWrite();
          // if (!dataFactory.isReadOnly() && file.canWrite())
          if (b1 && b2) {
          canUpdate = true;

          during which the flag canUpdate is set to "false", the (normally) to "true" again. This code is not synchronized with the call BaseContainer#use -> FileContainer#update to check the value of "canUpdate".

          Show
          Dag H. Wanvik added a comment - - edited Under reopening a container there is a small time-window inside in RAFContainer#run(OPEN_CONTAINER_ACTION): : canUpdate = false; try { boolean b1 = !dataFactory.isReadOnly(); boolean b2 = file.canWrite(); // if (!dataFactory.isReadOnly() && file.canWrite()) if (b1 && b2) { canUpdate = true; during which the flag canUpdate is set to "false", the (normally) to "true" again. This code is not synchronized with the call BaseContainer#use -> FileContainer#update to check the value of "canUpdate".
          Hide
          Mike Matrigali added a comment -

          It looks like there is little synchronization on that variable and access is shared across multiple files, as noted below.
          While very short it does seem possible for a change to happen to canUpdate after it is checked, and JIT probably can't
          opt it out as it is a shared variable. Just to check for a JIT bug it may help to rewrite as something like:

          if (!(tmp_canUpdate = canUpdate))

          { print both tmp and canUpdate }

          Maybe a bit more instrumentation might help. Maybe the stack of the call. Or at least for your test I think if you throw an
          assert then the new error reporting stuff will print stacks of every thread. It may not work though as we are talking about
          a few instructions while the stack dumps are likely 1000's or more. Has the feel of one thread opening a shared
          container and not properly coordinating with someone that is closing it. Given the test, is it likely someone is incorrectly
          closing the container?

          Is this test likely flooding the container cache?

          There is at least competition on this resource between user thread and store background daemon either doing checkpoint
          or post commit work. If the test has multiple threads than add that also.

          :
          BaseContainer.java: if (forUpdate && !canUpdate())
          BaseContainer.java: protected abstract boolean canUpdate();
          FileContainer.java: protected boolean canUpdate; // can I be written
          to?
          FileContainer.java: canUpdate = false;
          FileContainer.java: protected boolean canUpdate() {
          FileContainer.java: return canUpdate;
          InputStreamContainer.java: canUpdate = false;
          RAFContainer.java: canUpdate = true;
          RAFContainer.java: canUpdate = false;
          RAFContainer.java: canUpdate = true;
          RAFContainer.java: fileData = file.getRandomAccessFile(canUpdate
          ? "rw" : "r");
          RAFContainer.java: stub.getRandomAccessFile(canUpdat
          e ? "rw" : "r");

          Show
          Mike Matrigali added a comment - It looks like there is little synchronization on that variable and access is shared across multiple files, as noted below. While very short it does seem possible for a change to happen to canUpdate after it is checked, and JIT probably can't opt it out as it is a shared variable. Just to check for a JIT bug it may help to rewrite as something like: if (!(tmp_canUpdate = canUpdate)) { print both tmp and canUpdate } Maybe a bit more instrumentation might help. Maybe the stack of the call. Or at least for your test I think if you throw an assert then the new error reporting stuff will print stacks of every thread. It may not work though as we are talking about a few instructions while the stack dumps are likely 1000's or more. Has the feel of one thread opening a shared container and not properly coordinating with someone that is closing it. Given the test, is it likely someone is incorrectly closing the container? Is this test likely flooding the container cache? There is at least competition on this resource between user thread and store background daemon either doing checkpoint or post commit work. If the test has multiple threads than add that also. : BaseContainer.java: if (forUpdate && !canUpdate()) BaseContainer.java: protected abstract boolean canUpdate(); FileContainer.java: protected boolean canUpdate; // can I be written to? FileContainer.java: canUpdate = false; FileContainer.java: protected boolean canUpdate() { FileContainer.java: return canUpdate; InputStreamContainer.java: canUpdate = false; RAFContainer.java: canUpdate = true; RAFContainer.java: canUpdate = false; RAFContainer.java: canUpdate = true; RAFContainer.java: fileData = file.getRandomAccessFile(canUpdate ? "rw" : "r"); RAFContainer.java: stub.getRandomAccessFile(canUpdat e ? "rw" : "r");
          Hide
          Dag H. Wanvik added a comment - - edited

          Thanks, Mike. I think I understand why this is happening now.
          One thread sees an interrupt and goes on to reopen the container to get the NIO channel up again. During reopening the "canUpdate" state variable is briefly set to false before being set to true again. Meanwhile, another another thread is accessing the container, cf. the call stack of this issue. During that access, BaseContainer#use calls FileContainer#canUpdate and (sometimes) reads the "false" value in the short time window in question.

          Uploading a patch which omits setting "canUpdate" during reopening of the container. It should not have changed during the brief period the channel was closed due to the interrupt, so this seems ok, cf. the Javadoc for the variable "FileContainer#canUpdate": it should only change when we reset the identity of the container.

          Show
          Dag H. Wanvik added a comment - - edited Thanks, Mike. I think I understand why this is happening now. One thread sees an interrupt and goes on to reopen the container to get the NIO channel up again. During reopening the "canUpdate" state variable is briefly set to false before being set to true again. Meanwhile, another another thread is accessing the container, cf. the call stack of this issue. During that access, BaseContainer#use calls FileContainer#canUpdate and (sometimes) reads the "false" value in the short time window in question. Uploading a patch which omits setting "canUpdate" during reopening of the container. It should not have changed during the brief period the channel was closed due to the interrupt, so this seems ok, cf. the Javadoc for the variable "FileContainer#canUpdate": it should only change when we reset the identity of the container.
          Hide
          Mike Matrigali added a comment -

          the patch looks fine to me, the work need not be done for a reopen.

          But, isn't the real problem that there is some uncoordinated activity that should be coordinated. Some thread is actively doing
          work on the filecontainer because it thinks it is open, but the code is in the middle of reopening it. Is there going be some
          other problem with the rest of the work in the reopen and the competing thread or is that handled somehow else. I did not
          follow all the work in this area so if this is meant to be handled somehow else, good. maybe just another comment in your
          change explaining why it is ok for use() to complete while the file is being reopened would help. I see there is a comment
          about deadlock on synchonization later so that is likely an issue.

          Is closeContainer() access to fileData with no sync a problem?

          Do you think there is any similar problem with isStub as canUpdate?

          I don't know if it is possible but is it possible the only work that needs to be done on a reopen is:
          fileData = file.getRandomAccessFile(canUpdate ? "rw" : "r");
          or does the closing exception lose isStub, file, and/or fileName?

          Show
          Mike Matrigali added a comment - the patch looks fine to me, the work need not be done for a reopen. But, isn't the real problem that there is some uncoordinated activity that should be coordinated. Some thread is actively doing work on the filecontainer because it thinks it is open, but the code is in the middle of reopening it. Is there going be some other problem with the rest of the work in the reopen and the competing thread or is that handled somehow else. I did not follow all the work in this area so if this is meant to be handled somehow else, good. maybe just another comment in your change explaining why it is ok for use() to complete while the file is being reopened would help. I see there is a comment about deadlock on synchonization later so that is likely an issue. Is closeContainer() access to fileData with no sync a problem? Do you think there is any similar problem with isStub as canUpdate? I don't know if it is possible but is it possible the only work that needs to be done on a reopen is: fileData = file.getRandomAccessFile(canUpdate ? "rw" : "r"); or does the closing exception lose isStub, file, and/or fileName?
          Hide
          Dag H. Wanvik added a comment -

          Thanks for looking at this Mike! I agree that closeContainer should be synchronized on the container w.r.t handling fileData, since it will be accessed by other threads during IO which are oblivious to the reopening taking place. All such use cases are synchronized on "this", or are related to creation and opening. As far as I can see, closeContainer doesn't touch any other container state variables. "isStub" is method local (once defined and used on CREATE_CONTAINER_ACTION, once defined and used in OPEN_CONTAINER_ACTION), so that should be OK.

          As for simplifying to just to just resetting fileData, I'll look into if we could do that. "file" is also method local.and "fileName" is only set in OPEN_CONTAINER_ACTION and only ever used as a string when throwing exceptions, so it should be possible to simplify reopening.

          Show
          Dag H. Wanvik added a comment - Thanks for looking at this Mike! I agree that closeContainer should be synchronized on the container w.r.t handling fileData, since it will be accessed by other threads during IO which are oblivious to the reopening taking place. All such use cases are synchronized on "this", or are related to creation and opening. As far as I can see, closeContainer doesn't touch any other container state variables. "isStub" is method local (once defined and used on CREATE_CONTAINER_ACTION, once defined and used in OPEN_CONTAINER_ACTION), so that should be OK. As for simplifying to just to just resetting fileData, I'll look into if we could do that. "file" is also method local.and "fileName" is only set in OPEN_CONTAINER_ACTION and only ever used as a string when throwing exceptions, so it should be possible to simplify reopening.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a patch (derby-5312-simplify-reopen-1) which simplies the
          reopening by essentially only a new call to

          file.getRandomAccessFile(canUpdate ? "rw" : "r")

          as discussed. To get "file", I let RAFContainer4 keep the current
          identity so the reopen can find file by handing back the identity when
          we ned to make the call:

          StorageFile file = privGetFileName( actionIdentity, false, true, true);

          but removed that state variable from RAFContainer. I ran an earlier
          version of this overnight (some 800 iterations of the test), without
          seeing this error again. Running regressions.

          Show
          Dag H. Wanvik added a comment - Uploading a patch (derby-5312-simplify-reopen-1) which simplies the reopening by essentially only a new call to file.getRandomAccessFile(canUpdate ? "rw" : "r") as discussed. To get "file", I let RAFContainer4 keep the current identity so the reopen can find file by handing back the identity when we ned to make the call: StorageFile file = privGetFileName( actionIdentity, false, true, true); but removed that state variable from RAFContainer. I ran an earlier version of this overnight (some 800 iterations of the test), without seeing this error again. Running regressions.
          Hide
          Mike Matrigali added a comment -

          The new patch looks good to me, now it is very clear what reopen does and seems simpler.

          Show
          Mike Matrigali added a comment - The new patch looks good to me, now it is very clear what reopen does and seems simpler.
          Hide
          Dag H. Wanvik added a comment -

          Committed derby-5312-simplify-reopen-1 as svn 1148344, resolving, but not closing: will backport to 10.8 first.

          Show
          Dag H. Wanvik added a comment - Committed derby-5312-simplify-reopen-1 as svn 1148344, resolving, but not closing: will backport to 10.8 first.
          Hide
          Dag H. Wanvik added a comment -

          Back-ported to 10.8 branch as svn 1151122. Knut, feel free to close if you think the issue is adequately addressed.

          Show
          Dag H. Wanvik added a comment - Back-ported to 10.8 branch as svn 1151122. Knut, feel free to close if you think the issue is adequately addressed.
          Hide
          Knut Anders Hatlen added a comment -

          I haven't seen the failure again after the fix went in. Closing the issue. Thanks for fixing it, Dag.

          Show
          Knut Anders Hatlen added a comment - I haven't seen the failure again after the fix went in. Closing the issue. Thanks for fixing it, Dag.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development