Derby
  1. Derby
  2. DERBY-2254

Assert during log file switch: log file position exceeded max log file size

    Details

    • Urgency:
      Normal
    • Issue & fix info:
      Patch Available
    • Bug behavior facts:
      Data corruption

      Description

      When running simple tpc-b like transactions against a embedded Derby based on a SANE build of trunk the following assertion occurs for the background thread and all user threads:

      org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size

      This seems to occur during a switch to a new log file.

      derby.log contains the following call stack for the background thread:

      Exception trace:
      org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size
      at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120)
      at org.apache.derby.impl.store.raw.log.LogCounter.makeLogInstantAsLong(LogCounter.java:120)
      at org.apache.derby.impl.store.raw.log.LogToFile.switchLogFile(LogToFile.java:1900)
      at org.apache.derby.impl.store.raw.log.LogToFile.appendLogRecord(LogToFile.java:3530)
      at org.apache.derby.impl.store.raw.log.FileLogger.logAndDo(FileLogger.java:345)
      at org.apache.derby.impl.store.raw.xact.Xact.logAndDo(Xact.java:1185)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1540)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1357)
      at org.apache.derby.impl.store.raw.RawStore.checkpoint(RawStore.java:439)
      at org.apache.derby.impl.store.raw.log.LogToFile.performWork(LogToFile.java:3416)
      at org.apache.derby.impl.services.daemon.BasicDaemon.serviceClient(BasicDaemon.java:331)
      at org.apache.derby.impl.services.daemon.BasicDaemon.work(BasicDaemon.java:668)
      at org.apache.derby.impl.services.daemon.BasicDaemon.run(BasicDaemon.java:394)
      at java.lang.Thread.run(Thread.java:619)
      2007-01-17 23:09:48.638 GMT Thread[derby.rawStoreDaemon,5,derby.daemons] Cleanup action starting
      org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size
      at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120)
      at org.apache.derby.impl.store.raw.log.LogCounter.makeLogInstantAsLong(LogCounter.java:120)
      at org.apache.derby.impl.store.raw.log.LogToFile.switchLogFile(LogToFile.java:1900)
      at org.apache.derby.impl.store.raw.log.LogToFile.appendLogRecord(LogToFile.java:3530)
      at org.apache.derby.impl.store.raw.log.FileLogger.logAndDo(FileLogger.java:345)
      at org.apache.derby.impl.store.raw.xact.Xact.logAndDo(Xact.java:1185)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1540)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1357)
      at org.apache.derby.impl.store.raw.RawStore.checkpoint(RawStore.java:439)
      at org.apache.derby.impl.store.raw.log.LogToFile.performWork(LogToFile.java:3416)
      at org.apache.derby.impl.services.daemon.BasicDaemon.serviceClient(BasicDaemon.java:331)
      at org.apache.derby.impl.services.daemon.BasicDaemon.work(BasicDaemon.java:668)
      at org.apache.derby.impl.services.daemon.BasicDaemon.run(BasicDaemon.java:394)
      at java.lang.Thread.run(Thread.java:619)
      Cleanup action completed

      For my user threads the call stack is similar:

      Database Class Loader started - derby.database.classpath=''
      2007-01-17 23:09:36.401 GMT Thread[Thread-51,5,main] (XID = 12632406), (SESSIONID = 51), (DATABASE = /export/home/tmp/derby-db), (DRDAID = null), Cleanup action starting
      2007-01-17 23:09:36.401 GMT Thread[Thread-51,5,main] (XID = 12632406), (SESSIONID = 51), (DATABASE = /export/home/tmp/derby-db), (DRDAID = null), Failed Statement is: UPDATE accounts SET abal = abal + ? WHERE aid = ? AND bid = ?
      org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size
      at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120)
      at org.apache.derby.impl.store.raw.log.LogCounter.makeLogInstantAsLong(LogCounter.java:120)
      at org.apache.derby.impl.store.raw.log.LogToFile.switchLogFile(LogToFile.java:1900)
      at org.apache.derby.impl.store.raw.log.LogToFile.appendLogRecord(LogToFile.java:3530)
      at org.apache.derby.impl.store.raw.log.FileLogger.logAndDo(FileLogger.java:345)
      at org.apache.derby.impl.store.raw.xact.Xact.logAndDo(Xact.java:1185)
      at org.apache.derby.impl.store.raw.data.LoggableActions.doAction(LoggableActions.java:221)
      at org.apache.derby.impl.store.raw.data.LoggableActions.actionUpdate(LoggableActions.java:85)
      at org.apache.derby.impl.store.raw.data.StoredPage.doUpdateAtSlot(StoredPage.java:8521)
      at org.apache.derby.impl.store.raw.data.BasePage.updateAtSlot(BasePage.java:1108)
      at org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.replace(GenericConglomerateController.java:479)
      at org.apache.derby.impl.sql.execute.RowChangerImpl.updateRow(RowChangerImpl.java:523)
      at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:566)
      at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:260)
      at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:358)
      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1182)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1652)
      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:299)
      at com.sun.derby.perf.clients.tpcb.DBConnection.performTransaction(DBConnection.java:595)
      at com.sun.derby.perf.clients.tpcb.Client.run(Client.java:218)

      After this it seems like no user threads are able to connect to the database (not surpricing).

      This happend using a SANE build and I am unsure by just looking at the assertion in the code whether this situation would have been fatal also if I had been using an INSANE build.

      1. derby-2254-01-ac-accountForMoreOverhead.diff
        8 kB
        Rick Hillegas
      2. derby-2254-01-ab-accountForMoreOverhead.diff
        8 kB
        Rick Hillegas
      3. d2254.zip
        2 kB
        Rick Hillegas

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.
          Hide
          Mike Matrigali added a comment -

          backported to 10.5, 10.6 and 10.7. restoring orignal owner, don't plan on backporting any farther back now.

          Show
          Mike Matrigali added a comment - backported to 10.5, 10.6 and 10.7. restoring orignal owner, don't plan on backporting any farther back now.
          Hide
          Mike Matrigali added a comment -

          Assign temporarily for backport.

          Show
          Mike Matrigali added a comment - Assign temporarily for backport.
          Hide
          Kathey Marsden added a comment -

          Reopen for 10.5 backport consideration. If working on the backport for this issue. Temporarily assign yourself and add a comment that you are working on it. When finished, reresolve with the new fix versions or label backport_reject_10_x as appropriate.

          Show
          Kathey Marsden added a comment - Reopen for 10.5 backport consideration. If working on the backport for this issue. Temporarily assign yourself and add a comment that you are working on it. When finished, reresolve with the new fix versions or label backport_reject_10_x as appropriate.
          Hide
          Rick Hillegas added a comment -

          I think that it should be safe to backport this fix to earlier branches.

          Show
          Rick Hillegas added a comment - I think that it should be safe to backport this fix to earlier branches.
          Hide
          Rick Hillegas added a comment -

          Ported 1102417 from trunk to 10.8 branch at subversion revision 1102421.

          Show
          Rick Hillegas added a comment - Ported 1102417 from trunk to 10.8 branch at subversion revision 1102421.
          Hide
          Rick Hillegas added a comment -

          Thanks Knut and Dag. I have committed derby-2254-01-ac-accountForMoreOverhead.diff to trunk at subversion revision 1102417.

          Show
          Rick Hillegas added a comment - Thanks Knut and Dag. I have committed derby-2254-01-ac-accountForMoreOverhead.diff to trunk at subversion revision 1102417.
          Hide
          Dag H. Wanvik added a comment -

          I thought of the inlining possibility after I posted my comment, too. It should be easy for the compiler to inline this one, and in insane mode it would not even be a trade-off between speed and code volume (single assignment being cheaper than a procedure call). So, if you are so inclined, feel free to commit as is.

          Show
          Dag H. Wanvik added a comment - I thought of the inlining possibility after I posted my comment, too. It should be easy for the compiler to inline this one, and in insane mode it would not even be a trade-off between speed and code volume (single assignment being cheaper than a procedure call). So, if you are so inclined, feel free to commit as is.
          Hide
          Knut Anders Hatlen added a comment -

          I wouldn't worry too much about the performance impact of the patch. I'd trust the runtime compiler to do the right thing and inline the setter method. And even if it doesn't, it looks like the new code only adds two calls to the setter method per log record, and that cost should be negligible compared to the cost of writing the log record to disk. I'm not sure we have any good tests for this. We have the bank_tx load and the sr_update load in org.apache.derbyTesting.perf.clients.Runner that will produce a lot of log records, but you'll probably have to run with derby.system.durability=test or with the in-memory backend so that the changes aren't overshadowed by the tests waiting for the log being flushed to disk on commit.

          Show
          Knut Anders Hatlen added a comment - I wouldn't worry too much about the performance impact of the patch. I'd trust the runtime compiler to do the right thing and inline the setter method. And even if it doesn't, it looks like the new code only adds two calls to the setter method per log record, and that cost should be negligible compared to the cost of writing the log record to disk. I'm not sure we have any good tests for this. We have the bank_tx load and the sr_update load in org.apache.derbyTesting.perf.clients.Runner that will produce a lot of log records, but you'll probably have to run with derby.system.durability=test or with the in-memory backend so that the changes aren't overshadowed by the tests waiting for the log being flushed to disk on commit.
          Hide
          Rick Hillegas added a comment -

          Thanks for the quick review, Dag. Attaching derby-2254-01-ac-accountForMoreOverhead.diff, which makes the change you recommended. The full regression tests ran cleanly for me.

          Any suggestions about a test I should run to measure whether the changes affect performance?

          Thanks,
          -Rick

          Show
          Rick Hillegas added a comment - Thanks for the quick review, Dag. Attaching derby-2254-01-ac-accountForMoreOverhead.diff, which makes the change you recommended. The full regression tests ran cleanly for me. Any suggestions about a test I should run to measure whether the changes affect performance? Thanks, -Rick
          Hide
          Dag H. Wanvik added a comment -

          Great catch, Rick! Look safe to me, in that the added code could only make the log switch happen earlier. I believe its probably fine to keep the procedure for setting the end position, but I'd suggest running some performance tests so we'd know how much the impact is. Found this double assignment, which is probably a typo (harmless):

          > setEndPosition( endPosition = newLog.getFilePointer() );

          Show
          Dag H. Wanvik added a comment - Great catch, Rick! Look safe to me, in that the added code could only make the log switch happen earlier. I believe its probably fine to keep the procedure for setting the end position, but I'd suggest running some performance tests so we'd know how much the impact is. Found this double assignment, which is probably a typo (harmless): > setEndPosition( endPosition = newLog.getFilePointer() );
          Hide
          Rick Hillegas added a comment -

          Attaching derby-2254-01-ab-accountForMoreOverhead.diff, a candidate fix for this bug. After applying this patch, I can't reproduce the bug using the attached repro. I have run the repro for more than 1500 iterations.

          This is my understanding of what is going on here:

          A) The original assertion is supposed to check that you are not creating an illegal log instant. A log instant consists of a log file number and an offset into that file. A log instant is illegal if the file number or offset are too big. In this particular case, the assertion fails because the offset is too big.

          B) The offset is the actual end of the current log file. That is, the log file has already spilled past its legal end.

          C) The assertion is triggered after the log writer notices that the log file needs to be switched. The log writer ought to detect this condition and switch the log file BEFORE writing a log record which spills over the end. Unfortunately, the log writer only notices this condition AFTER a log record has been written which straddles the legal end of file.

          D) We check for whether we should switch by asking whether the remaining bytes we need to write will spill past the legal end of the file. To compute the number of bytes we expect to write, we add the length of the log wrapper overhead to the length of the current log record. However, it turns out that may need to write more bytes than that. The calculation does not include an extra four bytes needed to write a terminating 0 to the log. The calculation also does not include the bytes needed to write a checksum record.

          E) If we are already very close to the legal end of the log file, then we could spill past the end after writing the wrapper, the log record, the terminating 0, and the checksum record. I suppose that we get here if the checkpointing thread can't get in often enough to switch log files before we encroach on the end of the current file.

          F) The fix is to require that the log accomodate that extra overhead (the terminating 0 and the checksum record). Now we force a log switch if there is not enough room to write the current log record, its wrapper, the terminating 0, and the checksum record.

          I have made 3 changes:

          1) The actual change to LogToFile to increase the size of the overhead space we require to avoid a log switch.

          2) All of the places in LogToFile which update the remembered endPostion have been reworked to call a brief method which sanity checks whether we have spilled past the end of the file already. I put in this instrumentation in order to find where the extra bytes were being written. I think it is a useful sanity check but I can remove it if people think that it will slow down log processing significantly.

          3) I added a little more information to the original assertion and its friends in LogCounter.

          Touches the following files:

          ----------

          M java/engine/org/apache/derby/impl/store/raw/log/LogAccessFile.java

          Exposed the length of the checksum record so that LogToFile can accurately calculate its overhead.

          ----------

          M java/engine/org/apache/derby/impl/store/raw/log/LogCounter.java

          Items (1) and (2) as described above.

          ----------

          M java/engine/org/apache/derby/impl/store/raw/log/LogToFile.java

          Item (3) above.

          Show
          Rick Hillegas added a comment - Attaching derby-2254-01-ab-accountForMoreOverhead.diff, a candidate fix for this bug. After applying this patch, I can't reproduce the bug using the attached repro. I have run the repro for more than 1500 iterations. This is my understanding of what is going on here: A) The original assertion is supposed to check that you are not creating an illegal log instant. A log instant consists of a log file number and an offset into that file. A log instant is illegal if the file number or offset are too big. In this particular case, the assertion fails because the offset is too big. B) The offset is the actual end of the current log file. That is, the log file has already spilled past its legal end. C) The assertion is triggered after the log writer notices that the log file needs to be switched. The log writer ought to detect this condition and switch the log file BEFORE writing a log record which spills over the end. Unfortunately, the log writer only notices this condition AFTER a log record has been written which straddles the legal end of file. D) We check for whether we should switch by asking whether the remaining bytes we need to write will spill past the legal end of the file. To compute the number of bytes we expect to write, we add the length of the log wrapper overhead to the length of the current log record. However, it turns out that may need to write more bytes than that. The calculation does not include an extra four bytes needed to write a terminating 0 to the log. The calculation also does not include the bytes needed to write a checksum record. E) If we are already very close to the legal end of the log file, then we could spill past the end after writing the wrapper, the log record, the terminating 0, and the checksum record. I suppose that we get here if the checkpointing thread can't get in often enough to switch log files before we encroach on the end of the current file. F) The fix is to require that the log accomodate that extra overhead (the terminating 0 and the checksum record). Now we force a log switch if there is not enough room to write the current log record, its wrapper, the terminating 0, and the checksum record. I have made 3 changes: 1) The actual change to LogToFile to increase the size of the overhead space we require to avoid a log switch. 2) All of the places in LogToFile which update the remembered endPostion have been reworked to call a brief method which sanity checks whether we have spilled past the end of the file already. I put in this instrumentation in order to find where the extra bytes were being written. I think it is a useful sanity check but I can remove it if people think that it will slow down log processing significantly. 3) I added a little more information to the original assertion and its friends in LogCounter. Touches the following files: ---------- M java/engine/org/apache/derby/impl/store/raw/log/LogAccessFile.java Exposed the length of the checksum record so that LogToFile can accurately calculate its overhead. ---------- M java/engine/org/apache/derby/impl/store/raw/log/LogCounter.java Items (1) and (2) as described above. ---------- M java/engine/org/apache/derby/impl/store/raw/log/LogToFile.java Item (3) above.
          Hide
          Rick Hillegas added a comment -

          Attaching a repro which Knut created. This consists of the repro attached to DERBY-2286, which Knut discovered pops this bug if run against an in-memory database often enough. This repro creates an in-memory database and runs 4 update-intensive threads against it. If the bug doesn't occur within 30 seconds, the experiment is repeated. You may need to edit the run script to customize the location of your Java 6 jdk and your derby.jar.

          To run the repro, just unzip it and type

          ./run

          Running this on a Solaris machine with a lot of processors, I am able to reproduce this bug sometimes after a couple iterations, usually in under 200 iterations, and always in under 300 iterations.

          Show
          Rick Hillegas added a comment - Attaching a repro which Knut created. This consists of the repro attached to DERBY-2286 , which Knut discovered pops this bug if run against an in-memory database often enough. This repro creates an in-memory database and runs 4 update-intensive threads against it. If the bug doesn't occur within 30 seconds, the experiment is repeated. You may need to edit the run script to customize the location of your Java 6 jdk and your derby.jar. To run the repro, just unzip it and type ./run Running this on a Solaris machine with a lot of processors, I am able to reproduce this bug sometimes after a couple iterations, usually in under 200 iterations, and always in under 300 iterations.
          Hide
          Rick Hillegas added a comment -

          Linking to DERBY-2286, which has a repro program which can be used to trigger this bug.

          Show
          Rick Hillegas added a comment - Linking to DERBY-2286 , which has a repro program which can be used to trigger this bug.
          Hide
          Rick Hillegas added a comment -

          Re-categorizing this as a data corruption since the workaround of deleting the log file is scary.

          Show
          Rick Hillegas added a comment - Re-categorizing this as a data corruption since the workaround of deleting the log file is scary.
          Hide
          Trejkaz added a comment -

          We have one user experiencing this issue in production. We don't do any crazy stuff like compressing tables or checkpointing, but we do add lots of rows quickly and I don't know, maybe once the computer gets sufficiently fast, this issue surfaces.

          The machine our user are using is a monster with 32GB of RAM, although we allocate less than 2GB for each individual process which each have their own database.

          What is the underlying cause here? I/O not being fast enough to update the database so it has to write to the log instead? Is there any way to avoid it at all?

          Show
          Trejkaz added a comment - We have one user experiencing this issue in production. We don't do any crazy stuff like compressing tables or checkpointing, but we do add lots of rows quickly and I don't know, maybe once the computer gets sufficiently fast, this issue surfaces. The machine our user are using is a monster with 32GB of RAM, although we allocate less than 2GB for each individual process which each have their own database. What is the underlying cause here? I/O not being fast enough to update the database so it has to write to the log instead? Is there any way to avoid it at all?
          Hide
          Knut Anders Hatlen added a comment -

          Triaged for 10.5.2.

          Show
          Knut Anders Hatlen added a comment - Triaged for 10.5.2.
          Hide
          quartz added a comment -

          Because stored procedure call SYSCS_UTIL.SYSCS_COMPRESS_TABLE(.....)
          cannot run concurrently with inserts (see #DERBY-1068)
          our log file grows like crazy. I suppose either inserts goes really slow, or to a halt and backlog in the log file.
          Eventually, it crashes. Dead for good. Irrecoverable.
          We can't wait for a fix...

          Show
          quartz added a comment - Because stored procedure call SYSCS_UTIL.SYSCS_COMPRESS_TABLE(.....) cannot run concurrently with inserts (see # DERBY-1068 ) our log file grows like crazy. I suppose either inserts goes really slow, or to a halt and backlog in the log file. Eventually, it crashes. Dead for good. Irrecoverable. We can't wait for a fix...
          Hide
          Mike Matrigali added a comment -

          Deleting the log file is the easiest way to guarantee corruption in the derby database. It is impossible to know what kind of
          corruption might have happened but all the following are possible, because derby rely's on the info in the log file to maintain
          consistent data. Because of in memory database caching the database files on disk may be severely out of date/inconsistent without proper recovery processing using the log files. All the following corruptions are possible when one deletes one or more
          log files:
          transaction inconsistent data between tables - many of these inconsistencies can't be determined by derby
          inconsistent data between base tables and indexes - running the consistency checker does a good job of checking for this
          inconsistent data in long rows - a long row may stretch across many pages and the links depend on tranactional consistency if this
          is broken a number of runtime errors may occur
          inconsistent data in long columns - similar to long row a single column may be linked across many pages and this consistency is
          maintained by the transaction log.

          If it is ever necessary to remove the log because of some derby bug the only safe thing to do is to go back to your more recent
          backup. If you don't have one, what I usually tell users is to assume the db they are accessing is corrupt but accessible. The only
          safe thing at that point is to data mine the existing db to copy whatever data looks good out to a file and then load it into a newly
          created db. Otherwise one will never know if future errors you encounter in the db are because of this failed recovery.

          Show
          Mike Matrigali added a comment - Deleting the log file is the easiest way to guarantee corruption in the derby database. It is impossible to know what kind of corruption might have happened but all the following are possible, because derby rely's on the info in the log file to maintain consistent data. Because of in memory database caching the database files on disk may be severely out of date/inconsistent without proper recovery processing using the log files. All the following corruptions are possible when one deletes one or more log files: transaction inconsistent data between tables - many of these inconsistencies can't be determined by derby inconsistent data between base tables and indexes - running the consistency checker does a good job of checking for this inconsistent data in long rows - a long row may stretch across many pages and the links depend on tranactional consistency if this is broken a number of runtime errors may occur inconsistent data in long columns - similar to long row a single column may be linked across many pages and this consistency is maintained by the transaction log. If it is ever necessary to remove the log because of some derby bug the only safe thing to do is to go back to your more recent backup. If you don't have one, what I usually tell users is to assume the db they are accessing is corrupt but accessible. The only safe thing at that point is to data mine the existing db to copy whatever data looks good out to a file and then load it into a newly created db. Otherwise one will never know if future errors you encounter in the db are because of this failed recovery.
          Hide
          quartz added a comment -

          A few details:
          -I have this state generated from a linux machine (not OS specific).
          -I use the derbynet NetworkServerControl, not the embeded driver.
          -I also saw the error while starting up on windows with embeded driver over the copied DB.
          -I am on 10.2.2.0.

          I deleted the log file and the DB works. Of course I expect data loss but
          I wanted to know that only the log file seems affected.

          Show
          quartz added a comment - A few details: -I have this state generated from a linux machine (not OS specific). -I use the derbynet NetworkServerControl, not the embeded driver. -I also saw the error while starting up on windows with embeded driver over the copied DB. -I am on 10.2.2.0. I deleted the log file and the DB works. Of course I expect data loss but I wanted to know that only the log file seems affected.
          Hide
          Olav Sandstaa added a comment -

          Just for the record: I tried to restart the database to see if it was able to recover. Not surpricingly, after about three hours into the recovery, it was aborted with the same assert and the following call stack:

          Exception trace:
          org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size
          at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120)
          at org.apache.derby.impl.store.raw.log.LogCounter.makeLogInstantAsLong(LogCounter.java:120)
          at org.apache.derby.impl.store.raw.log.Scan.getNextRecordForward(Scan.java:973)
          at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:206) at org.apache.derby.impl.store.raw.log.FileLogger.redo(FileLogger.java:1176)
          at org.apache.derby.impl.store.raw.log.LogToFile.recover(LogToFile.java:811)
          at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:308)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:546)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:419)
          at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:988)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:546)
          at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:419)
          at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:746)
          at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:182)
          at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994)
          at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291)
          at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(BaseMonitor.java:1829)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(BaseMonitor.java:1695)
          at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(BaseMonitor.java:1575)
          at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(BaseMonitor.java:994)
          at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Monitor.java:542)
          at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(EmbedConnection.java:1633)
          at org.apache.derby.impl.jdbc.EmbedConnection.<init>(EmbedConnection.java:223)
          at org.apache.derby.impl.jdbc.EmbedConnection30.<init>(EmbedConnection30.java:73)
          at org.apache.derby.impl.jdbc.EmbedConnection40.<init>(EmbedConnection40.java:54)
          at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Driver40.java:64)
          at org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:210)
          at org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:117)
          at java.sql.DriverManager.getConnection(DriverManager.java:582)
          at java.sql.DriverManager.getConnection(DriverManager.java:185)
          at com.sun.derby.perf.clients.tpcb.DBConnection.loadDriver(DBConnection.java:140)
          at com.sun.derby.perf.clients.tpcb.DBConnection.<init>(DBConnection.java:49)
          at com.sun.derby.perf.clients.tpcb.TPCBClient.run(TPCBClient.java:132)
          at com.sun.derby.perf.clients.tpcb.TPCBClient.main(TPCBClient.java:381)

          Show
          Olav Sandstaa added a comment - Just for the record: I tried to restart the database to see if it was able to recover. Not surpricingly, after about three hours into the recovery, it was aborted with the same assert and the following call stack: Exception trace: org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED log file position exceeded max log file size at org.apache.derby.shared.common.sanity.SanityManager.ASSERT(SanityManager.java:120) at org.apache.derby.impl.store.raw.log.LogCounter.makeLogInstantAsLong(LogCounter.java:120) at org.apache.derby.impl.store.raw.log.Scan.getNextRecordForward(Scan.java:973) at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Scan.java:206) at org.apache.derby.impl.store.raw.log.FileLogger.redo(FileLogger.java:1176) at org.apache.derby.impl.store.raw.log.LogToFile.recover(LogToFile.java:811) at org.apache.derby.impl.store.raw.RawStore.boot(RawStore.java:308) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:546) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:419) at org.apache.derby.impl.store.access.RAMAccessManager.boot(RAMAccessManager.java:988) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(BaseMonitor.java:546) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Monitor.java:419) at org.apache.derby.impl.db.BasicDatabase.bootStore(BasicDatabase.java:746) at org.apache.derby.impl.db.BasicDatabase.boot(BasicDatabase.java:182) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(BaseMonitor.java:1994) at org.apache.derby.impl.services.monitor.TopService.bootModule(TopService.java:291) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(BaseMonitor.java:1829) at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(BaseMonitor.java:1695) at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(BaseMonitor.java:1575) at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(BaseMonitor.java:994) at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Monitor.java:542) at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(EmbedConnection.java:1633) at org.apache.derby.impl.jdbc.EmbedConnection.<init>(EmbedConnection.java:223) at org.apache.derby.impl.jdbc.EmbedConnection30.<init>(EmbedConnection30.java:73) at org.apache.derby.impl.jdbc.EmbedConnection40.<init>(EmbedConnection40.java:54) at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Driver40.java:64) at org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:210) at org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:117) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at com.sun.derby.perf.clients.tpcb.DBConnection.loadDriver(DBConnection.java:140) at com.sun.derby.perf.clients.tpcb.DBConnection.<init>(DBConnection.java:49) at com.sun.derby.perf.clients.tpcb.TPCBClient.run(TPCBClient.java:132) at com.sun.derby.perf.clients.tpcb.TPCBClient.main(TPCBClient.java:381)
          Hide
          Olav Sandstaa added a comment -

          Mike, thank you for commenting on this and for explaining what you
          think are the issues behind this problem. I agree that the best
          solution seems be to fix this so that the background thread can do
          log file switching without waiting for the checkpoing to complete.

          Some questions somebody maybe know the answer to:

          -What is the reason for most of the log files having the default
          size of 1 MB but a few (4) log files grow to 268 kB before
          switching?

          -The crash occurs after having produced about 1 GB of log. Is there
          a hard limit on 1 GB before the checkpoint must be completed?

          This crash happened during some experimental testing where I want to
          study data rates to and from the disks during high load. I can do this
          with a much smaller database buffer so being able to run with a 18 GB
          database buffer is not an urgent issue for me right now.

          Show
          Olav Sandstaa added a comment - Mike, thank you for commenting on this and for explaining what you think are the issues behind this problem. I agree that the best solution seems be to fix this so that the background thread can do log file switching without waiting for the checkpoing to complete. Some questions somebody maybe know the answer to: -What is the reason for most of the log files having the default size of 1 MB but a few (4) log files grow to 268 kB before switching? -The crash occurs after having produced about 1 GB of log. Is there a hard limit on 1 GB before the checkpoint must be completed? This crash happened during some experimental testing where I want to study data rates to and from the disks during high load. I can do this with a much smaller database buffer so being able to run with a 18 GB database buffer is not an urgent issue for me right now.
          Hide
          Mike Matrigali added a comment -

          It would also be good from a performance standpoint to minimize the amount of data written to a log file over it's preallocated size, as on some systems those writes cost twice as much as writes to the preallocated size.

          Show
          Mike Matrigali added a comment - It would also be good from a performance standpoint to minimize the amount of data written to a log file over it's preallocated size, as on some systems those writes cost twice as much as writes to the preallocated size.
          Hide
          Mike Matrigali added a comment -

          Also wanted to note that this max limit is really a hard limit as the log address can only properly address files in this range. Once you go over that size we can't properly address records in that file.

          Show
          Mike Matrigali added a comment - Also wanted to note that this max limit is really a hard limit as the log address can only properly address files in this range. Once you go over that size we can't properly address records in that file.
          Hide
          Mike Matrigali added a comment -

          This is very interesting, you are definitely delving into areas never before tested. I don't think we have ever even had a machine with more than 2GB of memory. 18gb is definitely going to stress the system.

          Here the issues I think you are running into, it may be better to get them into separate issues for tracking:
          1) my best guess is that the act of switching the log file at 1 gig has become stuck behind finishing the checkpoint. I don't
          believe there is a design issue that we have to wait for checkpoint to switch. My first guess is that out single background
          daemon is stuck doing a checkpoint and can't service the switch request - but I have not checked the code. For these
          multi-processor systems a single daemon not doing anything async is probably not going to cut it.
          2) The math for largest file number looks like it has a bug somewhere. I don't think we ever really expected a file to grow to
          this size, mostly it was there to make sure someone didn't set the log file size to greater than it. A test should be written
          and the code fixed to properly handle forcing a switch, or waiting for a switch if you get to this size. My guess is this has
          not been tested before. I am not sure how easy such a test will be, other than a stress test like yours - though the real
          fix is probably to fix 1 and avoid 2 all together.
          3) ASSERT messages weren't very helpful, I'll update them so at least we get the real info out when you run into this.

          Show
          Mike Matrigali added a comment - This is very interesting, you are definitely delving into areas never before tested. I don't think we have ever even had a machine with more than 2GB of memory. 18gb is definitely going to stress the system. Here the issues I think you are running into, it may be better to get them into separate issues for tracking: 1) my best guess is that the act of switching the log file at 1 gig has become stuck behind finishing the checkpoint. I don't believe there is a design issue that we have to wait for checkpoint to switch. My first guess is that out single background daemon is stuck doing a checkpoint and can't service the switch request - but I have not checked the code. For these multi-processor systems a single daemon not doing anything async is probably not going to cut it. 2) The math for largest file number looks like it has a bug somewhere. I don't think we ever really expected a file to grow to this size, mostly it was there to make sure someone didn't set the log file size to greater than it. A test should be written and the code fixed to properly handle forcing a switch, or waiting for a switch if you get to this size. My guess is this has not been tested before. I am not sure how easy such a test will be, other than a stress test like yours - though the real fix is probably to fix 1 and avoid 2 all together. 3) ASSERT messages weren't very helpful, I'll update them so at least we get the real info out when you run into this.
          Hide
          Olav Sandstaa added a comment -

          The crash happened when running simple transactions (3 updates, 1
          insert, 1 select) against a database with 10 GB user data (total data
          volume 17 GB). Derby was configured with a 18 GB database buffer. The
          crash occured about 2 hours after re-starting an existing
          database. The load consisted of 64 clients running transactions
          back-to-back. Since the database was "just" started most of the
          transactions had to read data from the disk, thus there were about 60
          concurrent read requests for getting data into the database buffer. At
          the same time the single back-ground thread was having a hard time to
          keep up writing out dirty data to the disk. At this time there were
          plenty of free room in the page cache so no data blocks were "evicted"
          out to disk by the user threads.

          Since the background thread were not able to flush data pages fast
          enough it was also unable to reclaim log files. So the amount of log
          kept growing.

          One thing I do not understand is that most of the log files have their
          normal size while some are very close to the max log file size. Here
          is a list of the files in the log directory after the crash:

          rw-r---- 1 olav 48 Jan 17 22:09 log.ctrl
          rw-r---- 1 olav 48 Jan 17 22:09 logmirror.ctrl
          rw-r---- 1 olav 1060411 Jan 17 22:09 log1181.dat
          rw-r---- 1 olav 1067038 Jan 17 22:09 log1182.dat
          rw-r---- 1 olav 1049488 Jan 17 22:09 log1183.dat
          rw-r---- 1 olav 1050176 Jan 17 22:09 log1184.dat
          rw-r---- 1 olav 1050758 Jan 17 22:10 log1185.dat
          rw-r---- 1 olav 1050583 Jan 17 22:10 log1186.dat
          rw-r---- 1 olav 1050178 Jan 17 22:10 log1187.dat
          rw-r---- 1 olav 1054465 Jan 17 22:10 log1188.dat
          rw-r---- 1 olav 1051730 Jan 17 22:11 log1189.dat
          rw-r---- 1 olav 1052601 Jan 17 22:11 log1190.dat
          rw-r---- 1 olav 268435436 Jan 17 22:55 log1191.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1192.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1193.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1194.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1195.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1196.dat
          rw-r---- 1 olav 1048576 Jan 17 22:55 log1197.dat
          rw-r---- 1 olav 268435434 Jan 17 23:23 log1198.dat
          rw-r---- 1 olav 1048576 Jan 17 23:23 log1199.dat
          rw-r---- 1 olav 1048576 Jan 17 23:23 log1200.dat
          rw-r---- 1 olav 1048576 Jan 17 23:23 log1201.dat
          rw-r---- 1 olav 1048576 Jan 17 23:23 log1202.dat
          rw-r---- 1 olav 268435429 Jan 17 23:46 log1203.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1204.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1205.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1206.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1207.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1208.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1209.dat
          rw-r---- 1 olav 1048576 Jan 17 23:46 log1210.dat
          rw-r---- 1 olav 268435456 Jan 18 00:09 log1211.dat
          rw-r---- 1 olav 0 Jan 18 00:09 log1212.dat

          In org.apache.derby.impl.store.raw.log.LogCounter the following constant is defined

          // reserve top 4 bits in log file size for future use
          public static final long MAX_LOGFILE_SIZE = (long)0x0FFFFFFFL; // 268435455

          and the assert is defined as:

          SanityManager.ASSERT(filepos < MAX_LOGFILE_SIZE,
          "log file position exceeded max log file size");

          Comparing the constant MAX_LOGFILE_SIZE (268435455) with the length of
          the second last file above (268435456) it actually seems like this log
          file has become a byte too long.

          Show
          Olav Sandstaa added a comment - The crash happened when running simple transactions (3 updates, 1 insert, 1 select) against a database with 10 GB user data (total data volume 17 GB). Derby was configured with a 18 GB database buffer. The crash occured about 2 hours after re-starting an existing database. The load consisted of 64 clients running transactions back-to-back. Since the database was "just" started most of the transactions had to read data from the disk, thus there were about 60 concurrent read requests for getting data into the database buffer. At the same time the single back-ground thread was having a hard time to keep up writing out dirty data to the disk. At this time there were plenty of free room in the page cache so no data blocks were "evicted" out to disk by the user threads. Since the background thread were not able to flush data pages fast enough it was also unable to reclaim log files. So the amount of log kept growing. One thing I do not understand is that most of the log files have their normal size while some are very close to the max log file size. Here is a list of the files in the log directory after the crash: rw-r ---- 1 olav 48 Jan 17 22:09 log.ctrl rw-r ---- 1 olav 48 Jan 17 22:09 logmirror.ctrl rw-r ---- 1 olav 1060411 Jan 17 22:09 log1181.dat rw-r ---- 1 olav 1067038 Jan 17 22:09 log1182.dat rw-r ---- 1 olav 1049488 Jan 17 22:09 log1183.dat rw-r ---- 1 olav 1050176 Jan 17 22:09 log1184.dat rw-r ---- 1 olav 1050758 Jan 17 22:10 log1185.dat rw-r ---- 1 olav 1050583 Jan 17 22:10 log1186.dat rw-r ---- 1 olav 1050178 Jan 17 22:10 log1187.dat rw-r ---- 1 olav 1054465 Jan 17 22:10 log1188.dat rw-r ---- 1 olav 1051730 Jan 17 22:11 log1189.dat rw-r ---- 1 olav 1052601 Jan 17 22:11 log1190.dat rw-r ---- 1 olav 268435436 Jan 17 22:55 log1191.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1192.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1193.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1194.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1195.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1196.dat rw-r ---- 1 olav 1048576 Jan 17 22:55 log1197.dat rw-r ---- 1 olav 268435434 Jan 17 23:23 log1198.dat rw-r ---- 1 olav 1048576 Jan 17 23:23 log1199.dat rw-r ---- 1 olav 1048576 Jan 17 23:23 log1200.dat rw-r ---- 1 olav 1048576 Jan 17 23:23 log1201.dat rw-r ---- 1 olav 1048576 Jan 17 23:23 log1202.dat rw-r ---- 1 olav 268435429 Jan 17 23:46 log1203.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1204.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1205.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1206.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1207.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1208.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1209.dat rw-r ---- 1 olav 1048576 Jan 17 23:46 log1210.dat rw-r ---- 1 olav 268435456 Jan 18 00:09 log1211.dat rw-r ---- 1 olav 0 Jan 18 00:09 log1212.dat In org.apache.derby.impl.store.raw.log.LogCounter the following constant is defined // reserve top 4 bits in log file size for future use public static final long MAX_LOGFILE_SIZE = (long)0x0FFFFFFFL; // 268435455 and the assert is defined as: SanityManager.ASSERT(filepos < MAX_LOGFILE_SIZE, "log file position exceeded max log file size"); Comparing the constant MAX_LOGFILE_SIZE (268435455) with the length of the second last file above (268435456) it actually seems like this log file has become a byte too long.

            People

            • Assignee:
              Rick Hillegas
              Reporter:
              Olav Sandstaa
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development