Derby
  1. Derby
  2. DERBY-4252

intermittent corruption./java.io.EOFException in RandomAccessFile.readInt() on boot after compress with background checkpoint

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 10.6.1.0
    • Fix Version/s: None
    • Component/s: Store
    • Environment:
    • Urgency:
      Normal
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Data corruption

      Description

      The attached reproduction reproBackgroundCheckpoint.zip occasionally results in an EOF exception on boot with IBM 1.6 (about 1 out of 200 runs) even after the fix for DERBY-4239. The program inserts into /deletes some data from a table and runs compress, and as a daemon thread that loops checkpoints. I have not seen it with the Sun JVM.

      Before the fix for DERBY-4239, this issue was much more frequent. Sometimes you would get the DERBY-4239 trace and sometimes this one. After the fix, this issue remains. Mike thought it looked like a different bug and asked that we file a separate issue.

      This is the trace:
      java.io.EOFException

      at java.io.RandomAccessFile.readInt(RandomAccessFile.java:739)

      at java.io.RandomAccessFile.readLong(RandomAccessFile.java:772)

      at org.apache.derby.impl.store.raw.log.Scan.getNextRecordForward(Unknown Source)

      at org.apache.derby.impl.store.raw.log.Scan.getNextRecord(Unknown Source)

      at org.apache.derby.impl.store.raw.log.FileLogger.redo(Unknown Source)

      at org.apache.derby.impl.store.raw.log.LogToFile.recover(Unknown Source)

      at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)

      at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)

      at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)

      at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)

      at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)

      at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)

      at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)

      at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)

      at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)

      at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)

      at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)

      at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source)

      at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)

      at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)

      at java.sql.DriverManager.getConnection(DriverManager.java:316)

      at java.sql.DriverManager.getConnection(DriverManager.java:273)

      at CheckTables.main(CheckTables.java:8)

      To reproduce, compile the java programs and run reprobckchkpt.ksh. You may want to increase the number of iterations and let it run overnight to see the failure or back out the fix for DERBY-4239 to see it happen quicker.

      1. reproBackgroundCheckpoint.zip
        15 kB
        Kathey Marsden
      2. corrupt_database_with_logs.zip
        2.39 MB
        Kathey Marsden

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Closed Closed
          713d 4h 29m 1 Kathey Marsden 16/May/11 18:32
          Gavin made changes -
          Workflow jira [ 12464916 ] Default workflow, editable Closed status [ 12801050 ]
          Kathey Marsden made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Cannot Reproduce [ 5 ]
          Hide
          Kathey Marsden added a comment -

          I am not able to reproduce this on trunk with the latest IBM JVM.

          Show
          Kathey Marsden added a comment - I am not able to reproduce this on trunk with the latest IBM JVM.
          Hide
          Rick Hillegas added a comment -

          Hi Kathey,

          I have no other ideas. As you can see, Knut and I have not been able to reproduce this problem on other platforms either. Thanks.

          Show
          Rick Hillegas added a comment - Hi Kathey, I have no other ideas. As you can see, Knut and I have not been able to reproduce this problem on other platforms either. Thanks.
          Hide
          Kathey Marsden added a comment -

          The SR9 run on trunk ran 2000 times with out incident. However this is a different machine (still XP), different, java version and different derby version.

          I went back to SR4 and got a hang after 542 iterations where the ReproCorruptionBackgroundCheckpoint program completed its work, printing Ending Test: test1 but did not exit as though there was some running thread blocking exit. I tried unsuccessfully to get a thread dump of the hanging process. The database itself was fine.

          I think I would be comfortable closing this cannot reproduce unless someone else has some ideas for something else I should try.

          Show
          Kathey Marsden added a comment - The SR9 run on trunk ran 2000 times with out incident. However this is a different machine (still XP), different, java version and different derby version. I went back to SR4 and got a hang after 542 iterations where the ReproCorruptionBackgroundCheckpoint program completed its work, printing Ending Test: test1 but did not exit as though there was some running thread blocking exit. I tried unsuccessfully to get a thread dump of the hanging process. The database itself was fine. I think I would be comfortable closing this cannot reproduce unless someone else has some ideas for something else I should try.
          Hide
          Kathey Marsden added a comment -

          I kicked off a run with trunk and IBM 1.6 SR9. I will let it run over the weekend and see if it pops the issue.

          Booting Derby version The Apache Software Foundation - Apache Derby - 10.9.0.0 a
          lpha - (1097280): instance a816c00e-012f-eac3-8c01-0000001c6ff0
          on database directory C:\cygwin\home\kmarsden\repro\derby-4252\wombat with clas
          s loader sun.misc.Launcher$AppClassLoader@4b304b30
          Loaded from file:/C:/cygwin/svn4/trunka/jars/sane/derby.jar
          java.vendor=IBM Corporation
          java.runtime.version=jvmwi3260sr9-20110203_74623
          java.fullversion=JRE 1.6.0 IBM J9 2.4 Windows XP x86-32 jvmwi3260sr9-20110203_74
          623 (JIT enabled, AOT enabled)
          J9VM - 20110203_074623
          JIT - r9_20101028_17488ifx3

          Regarding what you see in the log files with the clustered compresses at the end, it would be interesting to see if that was the same for the passing runs. The test basically.
          1) Launches a thread that loops doing checkpoints with no pause between.
          2) Loads a table with 5000 rows.
          3) Deletes all the rows with a single sql statement delete from
          4) calls SYSCS_UTIL.SYSCS_INPLACE_COMPRESS_TABLE();
          5) commits
          6) exits normally
          7) Calls a separate program to check the tables.

          Looking back on this now I wonder, should everything be ok if the exit came mid-checkpoint? I think I asked this once before and the answer was yes, but I forget the why of it.

          Show
          Kathey Marsden added a comment - I kicked off a run with trunk and IBM 1.6 SR9. I will let it run over the weekend and see if it pops the issue. Booting Derby version The Apache Software Foundation - Apache Derby - 10.9.0.0 a lpha - (1097280): instance a816c00e-012f-eac3-8c01-0000001c6ff0 on database directory C:\cygwin\home\kmarsden\repro\derby-4252\wombat with clas s loader sun.misc.Launcher$AppClassLoader@4b304b30 Loaded from file:/C:/cygwin/svn4/trunka/jars/sane/derby.jar java.vendor=IBM Corporation java.runtime.version=jvmwi3260sr9-20110203_74623 java.fullversion=JRE 1.6.0 IBM J9 2.4 Windows XP x86-32 jvmwi3260sr9-20110203_74 623 (JIT enabled, AOT enabled) J9VM - 20110203_074623 JIT - r9_20101028_17488ifx3 Regarding what you see in the log files with the clustered compresses at the end, it would be interesting to see if that was the same for the passing runs. The test basically. 1) Launches a thread that loops doing checkpoints with no pause between. 2) Loads a table with 5000 rows. 3) Deletes all the rows with a single sql statement delete from 4) calls SYSCS_UTIL.SYSCS_INPLACE_COMPRESS_TABLE(); 5) commits 6) exits normally 7) Calls a separate program to check the tables. Looking back on this now I wonder, should everything be ok if the exit came mid-checkpoint? I think I asked this once before and the answer was yes, but I forget the why of it.
          Hide
          Kathey Marsden added a comment -

          Thank you Rick for looking at this issue. I will give the reproduction a try with the latest IBM JVM release later this week and post my results.

          Show
          Kathey Marsden added a comment - Thank you Rick for looking at this issue. I will give the reproduction a try with the latest IBM JVM release later this week and post my results.
          Hide
          Rick Hillegas added a comment -

          Hi Kathey,

          It has been almost 2 years since your last comment on this issue. Are you still seeing this problem on more recent versions of your VM?

          Thanks-Rick

          Show
          Rick Hillegas added a comment - Hi Kathey, It has been almost 2 years since your last comment on this issue. Are you still seeing this problem on more recent versions of your VM? Thanks-Rick
          Hide
          Rick Hillegas added a comment -

          I have run the log-file-inspecting tools against the files in the log directory of the bad database attached to this issue. The log-file-inspecting tools can be found on DERBY-5195. I can report the following facts about this data corruption:

          1) The control files log.ctrl and logmirror.ctrl appear to be intact and identical.

          2) There are 20 log files in the directory. The first 19 seem to be valid. At least, the LogFileReader is able to successfully deserialize all of them.

          3) The last log file is corrupt. It is only 17 bytes long so it doesn't even have a valid header record. I get an EOF reading this file. The initial formatableID, obsoleteVersion, and logFileNumber fields are correct but the previousRecordLogInstant is truncated. This is what I see:

          java LogFileReader wombat/log/log20.dat -n 0
          <?xml version="1.0" encoding="UTF-8" standalone="no"?>
          <logFile>
          <formatableID>128</formatableID>
          <obsoleteVersion>9</obsoleteVersion>
          <logFileNumber>20</logFileNumber>
          <message>java.io.EOFException: null</message>
          <stackTrace>java.io.EOFException
          at java.io.DataInputStream.readFully(DataInputStream.java:180)
          at java.io.DataInputStream.readLong(DataInputStream.java:399)
          at LogFileReader$LogFile.readFileHeader(LogFileReader.java:250)
          at LogFileReader$LogFile.readFile(LogFileReader.java:216)
          at LogFileReader$LogFile.printMe(LogFileReader.java:199)
          at LogFileReader.main(LogFileReader.java:88)
          </stackTrace>
          </logFile>

          4) I have noticed the following oddity, which may be a red herring: The last good log file (log19.dat) ends with a flurry of CheckpointOperations. Each CheckpointOperation is separated from the preceding CheckpointOperation by only a couple hundred bytes. In the other log files, the spacing of CheckpointOperations is generally several tens of thousands of bytes.

          I would be happy to continue exploring these files if someone would like to suggest what I should look for. Alternatively, the LogFileReader is very easy to use and would be even more powerful in the hands of a real Store expert.

          Show
          Rick Hillegas added a comment - I have run the log-file-inspecting tools against the files in the log directory of the bad database attached to this issue. The log-file-inspecting tools can be found on DERBY-5195 . I can report the following facts about this data corruption: 1) The control files log.ctrl and logmirror.ctrl appear to be intact and identical. 2) There are 20 log files in the directory. The first 19 seem to be valid. At least, the LogFileReader is able to successfully deserialize all of them. 3) The last log file is corrupt. It is only 17 bytes long so it doesn't even have a valid header record. I get an EOF reading this file. The initial formatableID, obsoleteVersion, and logFileNumber fields are correct but the previousRecordLogInstant is truncated. This is what I see: java LogFileReader wombat/log/log20.dat -n 0 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <logFile> <formatableID>128</formatableID> <obsoleteVersion>9</obsoleteVersion> <logFileNumber>20</logFileNumber> <message>java.io.EOFException: null</message> <stackTrace>java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at LogFileReader$LogFile.readFileHeader(LogFileReader.java:250) at LogFileReader$LogFile.readFile(LogFileReader.java:216) at LogFileReader$LogFile.printMe(LogFileReader.java:199) at LogFileReader.main(LogFileReader.java:88) </stackTrace> </logFile> 4) I have noticed the following oddity, which may be a red herring: The last good log file (log19.dat) ends with a flurry of CheckpointOperations. Each CheckpointOperation is separated from the preceding CheckpointOperation by only a couple hundred bytes. In the other log files, the spacing of CheckpointOperations is generally several tens of thousands of bytes. I would be happy to continue exploring these files if someone would like to suggest what I should look for. Alternatively, the LogFileReader is very easy to use and would be even more powerful in the hands of a real Store expert.
          Kathey Marsden made changes -
          Labels derby_triage10_5_2
          Dag H. Wanvik made changes -
          Urgency Normal
          Issue & fix info [Repro attached]
          Hide
          Dag H. Wanvik added a comment -

          Triaged for 10.5.2, checking "repro attached" and setting "normal" urgency.

          Show
          Dag H. Wanvik added a comment - Triaged for 10.5.2, checking "repro attached" and setting "normal" urgency.
          Hide
          Kathey Marsden added a comment -

          The frequency declined significantly after the fix for DERBY-4239, before that fix it would occur after about 12 iterations with IBM 1.6instead of 200. Also when I came up with the Windows repro for DERBY-4239, I only saw the DERBY-4239 error on the Sun JVM. On IBM 1.6 I would hit this EOFException first. See:

          https://issues.apache.org/jira/browse/DERBY-4239?focusedCommentId=12712720&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12712720

          Thank you for looking at this issue. When I asked Mike whether I should try to get a more reliable reproduction, he suggested we first analyze the log from the corrupted database. Mike posted some tips for doing this at: https://issues.apache.org/jira/browse/DERBY-4239?focusedCommentId=12711832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12711832

          Perhaps he has some more specific suggestions as to what to look for here.

          Show
          Kathey Marsden added a comment - The frequency declined significantly after the fix for DERBY-4239 , before that fix it would occur after about 12 iterations with IBM 1.6instead of 200. Also when I came up with the Windows repro for DERBY-4239 , I only saw the DERBY-4239 error on the Sun JVM. On IBM 1.6 I would hit this EOFException first. See: https://issues.apache.org/jira/browse/DERBY-4239?focusedCommentId=12712720&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12712720 Thank you for looking at this issue. When I asked Mike whether I should try to get a more reliable reproduction, he suggested we first analyze the log from the corrupted database. Mike posted some tips for doing this at: https://issues.apache.org/jira/browse/DERBY-4239?focusedCommentId=12711832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12711832 Perhaps he has some more specific suggestions as to what to look for here.
          Hide
          Rick Hillegas added a comment -

          Hi Kathey,

          I wanted to clarify the symptoms you recorded above in the Description section. It's unclear to me whether the frequency of this stack trace declined after the patch for DERBY-4239 or whether just the total number of failures declined because you longer saw the stack trace from DERBY-4239. Thanks.

          Show
          Rick Hillegas added a comment - Hi Kathey, I wanted to clarify the symptoms you recorded above in the Description section. It's unclear to me whether the frequency of this stack trace declined after the patch for DERBY-4239 or whether just the total number of failures declined because you longer saw the stack trace from DERBY-4239 . Thanks.
          Hide
          Rick Hillegas added a comment -

          I have failed to reproduce this bug after running the attached test for 800 iterations on Ubuntu using the Sun Java 6 VM. Except for the VM, my environment was identical to the environment of my previous test:

          Ubuntu 7.10 running onVMWare Fusion 1.1.1 on
          Mac OS X 10.5.7
          2.4 GHz Intel Core 2 Duo

          java version "1.6.0_14"
          Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
          Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode, sharing)

          Show
          Rick Hillegas added a comment - I have failed to reproduce this bug after running the attached test for 800 iterations on Ubuntu using the Sun Java 6 VM. Except for the VM, my environment was identical to the environment of my previous test: Ubuntu 7.10 running onVMWare Fusion 1.1.1 on Mac OS X 10.5.7 2.4 GHz Intel Core 2 Duo java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Client VM (build 14.0-b16, mixed mode, sharing)
          Hide
          Knut Anders Hatlen added a comment -

          I tried the repro (50 iterations) on an 8-way system and on a 32-way system with Solaris 10 and JDK 6u14, but I wasn't able to reproduce it. (Not that the repro has that many threads, but I thought it was worth a try.)

          Show
          Knut Anders Hatlen added a comment - I tried the repro (50 iterations) on an 8-way system and on a 32-way system with Solaris 10 and JDK 6u14, but I wasn't able to reproduce it. (Not that the repro has that many threads, but I thought it was worth a try.)
          Hide
          Kathey Marsden added a comment -

          I did try with the Sun 1.6 JVM on Windows
          java version "1.6.0_01"
          Java(TM) SE Runtime Environment (build 1.6.0_01-b06)
          Java HotSpot(TM) Client VM (build 1.6.0_01-b06, mixed mode)

          and could not reproduce.

          Show
          Kathey Marsden added a comment - I did try with the Sun 1.6 JVM on Windows java version "1.6.0_01" Java(TM) SE Runtime Environment (build 1.6.0_01-b06) Java HotSpot(TM) Client VM (build 1.6.0_01-b06, mixed mode) and could not reproduce.
          Hide
          Rick Hillegas added a comment -

          Thanks, Kathey. Did you try this on the Sun 1.6 VM? If not, I will try running the test on Ubuntu using that VM. Thanks.

          Show
          Rick Hillegas added a comment - Thanks, Kathey. Did you try this on the Sun 1.6 VM? If not, I will try running the test on Ubuntu using that VM. Thanks.
          Hide
          Kathey Marsden added a comment -

          On Windows as well I was not able to reproduce with the Sun JVM, even before the fix for DERBY-4239 when this was much easier to recreate. I have seen it only with IBM 1.6. I don't think I tried with IBM 1.5.

          Even though we have seen it only on one JVM, I would not say for sure that is a JVM bug until there is more investigation. In the past we have found some Derby issues easier to pop or exacerbated with IBM 1.6 because of increased concurrency.

          Show
          Kathey Marsden added a comment - On Windows as well I was not able to reproduce with the Sun JVM, even before the fix for DERBY-4239 when this was much easier to recreate. I have seen it only with IBM 1.6. I don't think I tried with IBM 1.5. Even though we have seen it only on one JVM, I would not say for sure that is a JVM bug until there is more investigation. In the past we have found some Derby issues easier to pop or exacerbated with IBM 1.6 because of increased concurrency.
          Hide
          Rick Hillegas added a comment -

          I have failed to reproduce this bug after running the attached test for 800 iterations on Ubuntu using the Sun Java 5 VM. My environment is:

          Ubuntu 7.10 running onVMWare Fusion 1.1.1 on
          Mac OS X 10.5.7
          2.4 GHz Intel Core 2 Duo

          java version "1.5.0_15"
          Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
          Java HotSpot(TM) Client VM (build 1.5.0_15-b04, mixed mode, sharing)

          Show
          Rick Hillegas added a comment - I have failed to reproduce this bug after running the attached test for 800 iterations on Ubuntu using the Sun Java 5 VM. My environment is: Ubuntu 7.10 running onVMWare Fusion 1.1.1 on Mac OS X 10.5.7 2.4 GHz Intel Core 2 Duo java version "1.5.0_15" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04) Java HotSpot(TM) Client VM (build 1.5.0_15-b04, mixed mode, sharing)
          Hide
          Rick Hillegas added a comment -

          I have failed to reproduce this bug after running the attached test for 800 iterations on my Mac, using the Apple Java 5 VM. My environment is:

          Mac OS X 10.5.7
          2.4 GHz Intel Core 2 Duo

          java version "1.5.0_19"
          Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304)
          Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)

          Show
          Rick Hillegas added a comment - I have failed to reproduce this bug after running the attached test for 800 iterations on my Mac, using the Apple Java 5 VM. My environment is: Mac OS X 10.5.7 2.4 GHz Intel Core 2 Duo java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304) Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)
          Kathey Marsden made changes -
          Attachment reproBackgroundCheckpoint.zip [ 12409824 ]
          Hide
          Kathey Marsden added a comment -

          The reproduction for DERBY-4239 exposed this issue.

          Show
          Kathey Marsden added a comment - The reproduction for DERBY-4239 exposed this issue.
          Kathey Marsden made changes -
          Link This issue is related to DERBY-4239 [ DERBY-4239 ]
          Kathey Marsden made changes -
          Field Original Value New Value
          Attachment corrupt_database_with_logs.zip [ 12409663 ]
          Hide
          Kathey Marsden added a comment -

          Attached is corrupt_database_with_logs.zip with the database and derby.log when the corruption occurs.
          derby.properties contained:
          derby.storage.keepTransactionLog=true
          derby.infolog.append=true

          Show
          Kathey Marsden added a comment - Attached is corrupt_database_with_logs.zip with the database and derby.log when the corruption occurs. derby.properties contained: derby.storage.keepTransactionLog=true derby.infolog.append=true
          Kathey Marsden created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Kathey Marsden
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development