Issue Details (XML | Word | Printable)

Key: DERBY-96
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Suresh Thalamati
Reporter: Suresh Thalamati
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

partial log record writes that occur because of out-of order writes need to be handled by recovery.

Created: 09/Dec/04 08:11 PM   Updated: 01/Jul/09 12:34 AM
Return to search
Component/s: Store
Affects Version/s: 10.0.2.1
Fix Version/s: 10.1.1.0

Time Tracking:
Not Specified

Resolution Date: 28/May/05 09:07 AM


 Description  « Hide
Incomplete log record write that occurs because of
an out of order partial writes gets recognized as complete during
recovery if the first sector and last sector happens to get written.
 Current system recognizes incompletely written log records by checking
the length of the record that is stored in the beginning and end.
 Format the log records are written to disk is:

  +----------+-------------+------------------+

  | length | LOG RECORD | length |

  +----------+-------------+------------------+


This mechanism works fine if sectors are written in sequential manner or
log record size is less than 2 sectors. I believe on SCSI types disks
order is not necessarily sequential, SCSI disk drives may sometimes do a
reordering of the sectors to optimize the performance. If a log record
that spans multiple disk sectors is being written to SCISI type of
devices, it is possible that first and last sector written before the
crash; If this occurs recovery system will incorrectly interpret the
log records was completely written and replay the record. This could
lead to recovery errors or data corruption.
-


This problem also will not occur if a disk drive has write cache with a
battery backup which will make sure I/O request will complete.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Suresh Thalamati added a comment - 09/Dec/04 11:35 PM
Some thoughts on how this problem could be solved:

To identify the partial writes, some form of checksum has to be added to the log data written to the file. On recovery using the checksum information partial written log records could be identified and thrown away. Checksum information has to be included
with the log data before it is written to the disk. Now the issue is when do we calculate the checksum and write to the disk.

Following are some logical points when the checksum can be calculated and written along with log informaton:

1)Calculate the checksum for each log records and store the information as part of log record data structure. Disadvantage of this approach, storing checksum with each log records could be expensive with respect to the amount of space and time spent to calculate.

2)Calculate checksum for group of log records in the log buffers before writing the buffer to the disk and also write an addition log records that will have the checksum information and the length of the data. This log records (LogCheckSum) will be prefixes to the log buffer. The reason checksum log records are to be written in the beginning is it is easier to find to how much data has to be read during recovery to verify the checksum.

    Log data is written only when log buffers is full or make sure WAL protocol is not violated. Size of the data that is part of the checksum can potentially be 32K or whatever log buffer size is. Overhead with this approach is less compared to the first approach.


3)Block-based log i/0: Idea is to group log record data into 4k/8K pages with a checksum on each page. During recovery checksum will be recalculated for each
and match it one on the disk, if checksum does not match it is possibly as partial write.
 
This approach is liked to have more overhead compared to the second one. But this approach also has the benefit of making log writes aligned. Not sure yet whether there is any performance by doing so. (Please see aligned Vs Non-Aligned e-mail thread on derby list).

I should also bring to the attention this approach will likely require more changes than 1 & 2 , reasons for that are :

a)Current system assumed LSN to file offset. If the data is written in page format , that will no longer be true.
b)To strict to WAL protocol , it may be required that an unfilled page needs to be written. If this unfilled page happened to have a COMMITTED log records it can not be simply rewritten; If the rewrite is incomplete log records with committed information will be thrown away. To avoid this issue, log pages can not be written , which could lead to of unused space in the log file or implement safe-write mechanism(ping-pong algorithm).


Upgrade:
Irrespective of what approach is used to solve this problem, I believe new type of information (checksum) has to be written to the disk, which will not be understood by Old versions.


Any comments/suggestion ?


-suresh

Suresh Thalamati added a comment - 12/Feb/05 04:12 AM

Conclusion was to solve this problem by writing a checksum log record before writing the log buffer and verify the checksum
during recovery.
 

I don't know how to link derby dev list e-mail to zira. just
doing copy/paste of comments from e-mail list.
Mike Matrigali wrote:


>>I think that some fix to this issue should be implemented for the next
>>release. The order of my preference is #2, #1, #3.
>>
>>


I believe option #2 (checksuming log recods in the log buffers before
writing to the disk) is a good fix for this problem.
If there are no objectiions to this approach, I will start to work on
this.


-suresht



>>I think that the option #2 can be implemented in the logging system and
>>require very little if no changes to the rest of the system processing
>>of log records. Log record offsets remain efficient, ie. they can use
>>LSN's directly. Only the boot time recovery code need look for the
>>new log record and do the work to verify checksums, online abort is
>>unaffected.
>>
>>I would like to see some performance numbers on the checksum overhead
>>and if it is measurable then maybe some discussion on checksum choice.
>>An obvious first choice would seem to be the standard java provided one
>>used on the data pages. If I had it to do over, I would probably have
>>used a different approach on the data pages. The point of the checksum
>>on the data page is not to catch data sector write errors, the system
>>expects the device to catch those, the only point is to catch
>>inconsistent sector writes (ie. 1st and 2nd 512 byte sector but not
>>3rd and 4th), for this the current checksum is overkill. For this one
>>need not checksum every byte on the page,
>>one can guarantee a consistent write with 1 bit per sector in the page.
>>
>>In the future we may want to revisit #3 if it looks like the stream log
>>is an I/O bottleneck which can't be addressed by striping or some other
>>hardware help like smart caching controllers. I see it as a performance
>>project rather than a correctness project. It also is a lot more work
>>and risk. Note that this could be a good project for someone wanting to
>>do some research in this area as it is implemented as a derby module
>>where an alternate implementation could be dropped in if available.
>>
>>While I believe that we should address this issue, I should also note
>>that in all my time working on cloudscape/derby I have never received a
>>problem database (in that time any log related error would have come
>>through me), that resulted from this out of order/imcomplete log
>>write issue - this of course does not mean it has not happened just that
>>it was not reported to us and/or did not affect the database in a
>>noticable way. We have actually never seen an out of order write from
>>the data pages also - we have seen a few checksum errors but all of
>>those were caused by a bad disk.
>>
>>On the upgrade issue, it may be time to start an upgrade thread. Here
>>are just some thoughts. If doing option #2, it would be nice if the
>>new code could still read the old log files and then optionally
>>write the new log record or not. Then if users wanted to run a
>>release in a "soft" upgrade mode where they needed to be able to
>>go back to the old software they could - they just would not get
>>this fix. On a "hard" upgrade the software should continue to read
>>the old log files as they are currently formatted, and for any new
>>log files it should begin writing the new log record. Once the new
>>log record make's it way into the log file accessing the db with the
>>old software is unsupported (it will throw an error as it won't know
>>what to do with the new log record).

Suresh Thalamati added a comment - 28/May/05 09:07 AM
Following changes fixed this problem:
r178494 :
small fix to make sure that log buffers are switched are correctly when the are
full, when the log checksum feature is disabled due to a soft upgrade.

r169737:
some functional tests to test the transaction log checksum feature.Log corruption is simulated using a proxy storage factory that allows corruption of the log write request before being writing to the disk. CorruptDiskStorage fact
ory by default forwards all the request to the underlying disk storage factory except when corruption flags are enabled.

Recovery tests need to boot the same database many times and have to use the different Subprotocol to enable the corruption instead of the default protocol. This seems to be
possible only by adding a new tests suite in the current test frame work. Add ed a new suite
called "storerecovery" , may be all future recovery tests can be added to this suite.

r164994:
changes to make softupgrade correctly with the transaction log checksum feature in 10.1 Added checkVersion() method to log factory it
self, becuase that is where the version numbers are read from from the log control file , but did not export the call it to the rawstore factory as it is not needed now. (This can be done easlily when there is a need for upgrade
checks in the other store modules..)

r159651:
This is a patch towards implementing checksum support for transaction log to handle out of order incomplete log writes during recovery. This patch is based on writing a checksum log record that contain checksum information for
a group of log records in the log buffers.

Changes in this patch addresses writing checksum information to the transaction log before the log records are written and verifying the log at recovery time using the checksum information on the disk.

Writing Checksum Log Records:
Checksum log record contains checksum Algorithm, size of the data and the checksum value.
Added a new class to implement this log operation.

The checksum Log record is placed before the actual log data the checksum record represents. This is done by reserving the space in the log buffers and in the log file then writing into reserved buffer space the checksum log record
whenever buffer is full or it need to be written because of a flush request due to a commit. Incase of a large log records that does not fit into a single log buffer, the log records are written directly to the log file, in this case
checksum log record represents only one log record and it is written to the log file before writing the large log record directly into the log file.

In the current system the log group information is encrypted when a database is encrypted. There is no facility to identify that a log record is checksum log record without decrypting the log record. Checksum Log Record is also encrypted to work correctly with the rest of the system.