Uploaded image for project: 'Derby'
  1. Derby
  2. DERBY-241

Encrypted run of stress.multi test failed once with a boot error with ibm142

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 10.1.1.0
    • None
    • Store
    • None
    • ibm142, machine is a dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE disk and has write cache enabled.

    Description

      The stress.multi test failed for encryption run with ibm142 on the following kind of machine once when running derbyall suite but have not been able to reproduce it since then.

      The machine on which it failed is a - dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE disk and has write cache enabled. As far as I can tell, the machine was up and running ok when the tests were running.

      Looking at the test directory for the stress.multi test, the derby.log seems to have a lot of interrupts and looking at the errors shows the following boot error.

      Booting Derby version The Apache Software Foundation - Apache Derby - 10.1.0.0 alpha - (31132): instance c013800d-0103-64b3-44ec-ffffa1f4cf33
      on database directory E:\classtest\JarResults.2005-04-20\ibm142_derbyall\derbyall\encryptionAll\encryption\multi\stress\mydb
      ERROR XSLA7: Cannot redo operation Page Operation: Page(5,Container(0, 384)) pageVersion 3 : Insert : Slot=2 recordId=8 in the log.

      Here are some of my notes in trying to debug this:

      0) Copied the problematic database to a safe location and used sane jars for debugging.
      1) Tried to boot the database using ij , and with the following debug property set - derby.debug.true=DumpLogOnly, this dumped all the log records into derby.log. Then searching for log records for the container(0,384) - found only 3 log records pertaining to it.

      there is one for create container and 2 records for insert.
      Space Operation for create container ( 0,384)
      Page operation for (Page 5, Container(0,384)), version 3 ,
      involving an insert at slot 2, record 8.
      Page operation for version 4, involving insert at slot 3,
      record 9.

      => There were no initPage operation for this page or any records pertaining to pageversion 1,2. This means that log records were missing, but the only case this would be ok is if it was a system catalog table. Since in case of create database, we flush the data pages to disk itself, so no logs in this case is OK.

      2)Next step - tried to verify if it was a system catalog table.
      Looking in the org.apache.derby.impl.store.access.RAMAccessManager, getNextConglomId(), the container key - 384 maps to 18th id.
      One way I verified it was I created another empty database and saw if this table existed c180.dat and it did.. which is right that is a system catalog table.

      3) To find the actual cause of the redo exception, I put in printstack traces in the code, and putting in the debugger - the error printed was
      ERROR XSDB1: Unknown page format at page Page(5,Container(0, 384))
      It seemed like the page format was messed up. I put printlns to get hte page format id ( in CachedPage, setIdentity) and tried to dump the contents of the page.
      The checksum validation actually would have happened if all was ok with the format id but since here the format id was messed up, this error is thrown instead of a checksum error.

      4) There is a od facility in MKS that dumps the contents in hex and character format. This table mapped to the 18th id, and that is the c180.dat in seg0 directory. Doing a dump od -c c180.dat shows stuff like this :
      S Y S C S _ B A C
      0000034040 K U P _ D A T A B A S E _ A N D
      0000034060 _ E N A B L E _ L O G _ A R C H
      0000034100 I V E _ M O D E \

      These seem to be system catalog procedure names, and it seems weird that it would not be encrypted.
      Need to verify if system catalogs are encrypted, if so then this probably is a interrupt problem with encryption.

      Attachments

        1. pageDataHexDump.txt
          17 kB
          Sunitha Kambhampati
        2. od_c_c180.txt
          83 kB
          Sunitha Kambhampati
        3. encryption_multi.zip
          745 kB
          Sunitha Kambhampati

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            skambha Sunitha Kambhampati
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment