Issue Details (XML | Word | Printable)

Key: DERBY-3611
Type: Bug Bug
Status: Resolved Resolved
Resolution: Duplicate
Priority: Critical Critical
Assignee: Unassigned
Reporter: David Sitsky
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Derby

ERROR XSDG2: Invalid checksum on Page occurs during mass inserts into two-column bigint PK table

Created: 10/Apr/08 01:24 AM   Updated: 30/Apr/08 08:16 PM
Return to search
Component/s: Store
Affects Version/s: 10.3.1.4, 10.3.2.1
Fix Version/s: 10.3.3.0, 10.4.1.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works d3347-1a+2a.diff 2008-04-10 02:38 PM Knut Anders Hatlen 19 kB
Text File Licensed for inclusion in ASF works derby-worker0.log 2008-04-10 01:25 AM David Sitsky 69 kB
Text File Licensed for inclusion in ASF works derby-worker3.log 2008-04-10 01:26 AM David Sitsky 183 kB
Text File Licensed for inclusion in ASF works derby-worker4.log 2008-04-10 01:26 AM David Sitsky 59 kB
Environment: Occurred on 6 separate quad-core machines running either Vista, Vista SP1 and Server 2008. Also seen on AMD64 dual core 4200 with 4 GB of ram running 32 bit XP pro.
Issue Links:
Reference
 

Resolution Date: 30/Apr/08 08:16 PM


 Description  « Hide
The original extensive email thread reporting this issue can be seen from here: http://www.nabble.com/ERROR-XSDG2%3A-Invalid-checksum-on-Page-Page%280%2CContainer%280%2C-1313%29%29-td16389697.html.

I have an intensive data-processing application which utilises Apache Derby, using 6 quad-core machines running Vista SP1 and/or Vista Server 2008. Each quad-core machine typically runs 4 separate JVM worker processes, each running their own embedded derby database.

I have found after 5 or 10 hours of processing, one or a couple of my worker processes, start reporting the following error in their derby.log file:

ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))

The worker process never seems to recover. Derby locates the error, reboots the database, but seems to inevitably report the same error again. I have tried both 10.3.1.4 and 10.3.2.1 with the same results. The conglomerate and page number is always the same.

I know it is not a hardware issue, as this is across 6 separate machines, and it has happened with software / hardware raid, and no disk errors have been reported. A customer of our software also reported this error occurring on their AMD64 dual core 4200 with 4 GB of ram running 32 bit XP pro.

The table the conglomerate refers to is as follows:

CREATE TABLE text_table (guidhigh BIGINT NOT NULL,
                         guid BIGINT NOT NULL,
                         data BLOB (1G) NOT NULL,
                         PRIMARY KEY (guidhigh, guid))

In this application, essentially random values for guidhigh and guid were being created, with data being compressed text, that could range from anything from a few bytes to many megabytes in size.

The processing code effectively did a select from the table on guidhigh and guid to check if an entry exists, before inserting a new row within a transaction.

If I forceable shut the application down, I could connect to the database using ij, and would get the same error:

ij> select count(*) from text_table;
ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313)), expected=304,608,373, on-disk version=2,462,088,751, page dump follows: Hex dump:
00000000: 0076 0000 0001 0000 0000 0000 27ea 0000 .v..............
00000010: 0000 0006 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0001 0000 0000 0000 0000 0000 ................
....

A workaround which we managed to implement in our application, as suggested from derby-user via Stanley Bradbury, was to not have the PK during the load, which we managed to implement. We also replaced the two column PK with a single column and the problem has since never occurred.

I'll attach a number of example derby.log files which contain the error messages.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
David Sitsky added a comment - 10/Apr/08 01:25 AM
Example derby.log file running from quad-core machine 0.

David Sitsky added a comment - 10/Apr/08 01:26 AM
Example derby.log file running from quad-core machines 3 and 4.

Dyre Tjeldvoll added a comment - 10/Apr/08 09:15 AM
I wonder if this is related to DERBY-3347? It is not the same error message, but the symptoms appear to be similar.

Knut Anders Hatlen added a comment - 10/Apr/08 11:48 AM
Some of the problems found in DERBY-3347 could lead to corruption on page 0, so yes, it might be related.

Knut Anders Hatlen added a comment - 10/Apr/08 02:38 PM
Hi David,

It sounds like the problem is reliably reproducible in your environment. I have attached two patches to DERBY-3347 that may affect this bug. (I have merged the patches into a single patch which I have attached here as well.) Is there any chance that you could build Derby with the patch applied and see if it solves your problems? Thanks.

David Sitsky added a comment - 11/Apr/08 03:48 AM
Hi Knut,

Something I didn't write in the description of this bug record but did write in my email is unfortunately it is not easy to reproduce. In a 24 hour run, where I had 22 individual JVM processes running across 6 quad-core machines (one machine only runs two processes), sometimes they would all run successfully, sometimes 1 or 2 processes may trigger the condition after many hours.

Given that they are all independent processes, you can see it isn't that easy to reproduce - 22 days of "serial processing time" may trigger the condition.

We have quite a lot of customers, but so far, only one has reported this issue to us.

Unfortunately, our 6 quad-core system is heavily used, so I may not be able to access it soon for running this test, but will try to do so next week time permitting, but I can't promise anything unfortunately.

I would actually recommend writing a small program with the table described in the bug report, and just an endless loop where you create random numbers of the two guid columns, with some random binary for the other. We just had a transaction which did a select on the two guid columns, and if it didn't exist, then did the insert, then the commit. This is basically what our application does.



Knut Anders Hatlen added a comment - 11/Apr/08 01:26 PM
OK, thanks, I'll try to do that.

David Sitsky added a comment - 23/Apr/08 03:52 AM
Hi Knut,

I now have access to my quad-core farm again. I have downloaded 10.4.1.2 and will hit my machines hard over the next few days with an older version of our software which had the issue reported in this bug record.

If we don't see the corruption issue by then, we can probably safely assume it has been fixed. I'll let you know how things go.

Cheers,
David

David Sitsky added a comment - 28/Apr/08 12:42 AM
I have been running the software with 10.4.1.2 for almost 5 days now with no database corruption, so I would say with reasonable certainty that the issue has been fixed - many thanks! Looking forward to the imminent major release.

Cheers,
David

Mike Matrigali added a comment - 30/Apr/08 08:16 PM
We believe this issue was one of the many ways DERBY-3347 could show itself. Stress testing has been performed, and
it looks like the 10.4 release fixes the problem. Closing this bug, please reopen or file a new issue if you manage to
reproduce this error starting with a fresh database using the software with the fix for DERBY-3347 in it.