[IGNITE-12127] WAL writer may close file IO with unflushed changes when MMAP is disabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.7.6
Component/s: None
Labels:
None

Description

Most likely the issue manifests itself as the following critical error:

2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to write buffer.]]
org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to write buffer.
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444) [ignite-core-2.5.7.jar!/:2.5.7]
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249) [ignite-core-2.5.7.jar!/:2.5.7]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.5.7.jar!/:2.5.7]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
Caused by: java.nio.channels.ClosedChannelException: null
    at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[na:1.8.0_201]
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) ~[na:1.8.0_201]
    at org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48) ~[ignite-core-2.5.7.jar!/:2.5.7]
    at org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41) ~[ignite-core-2.5.7.jar!/:2.5.7]
    at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111) ~[ignite-core-2.5.7.jar!/:2.5.7]
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437) [ignite-core-2.5.7.jar!/:2.5.7]
    ... 3 common frames omitted

It appears that there following sequence is possible:

Thread A attempts to log a large record which does not fit segment, addRecord fails and the thread A starts segment rollover. I successfully runs flushOrWait(null) and gets de-scheduled before adding switch segment record
Thread B attempts to log another record, which fits exactly till the end of the current segment. The record is added to the buffer
Thread A resumes and fails to add the switch segment record. No flush is performed and the thread immediately proceeds for wal-writer close
WAL writer thread wakes up, sees that there is a CLOSE request, closes the file IO and immediately proceeds to write unflushed changes causing the exception.

Unconditional flush after switch segment record write should fix the issue.

Besides the bug itself, I suggest the following changes to the FileWriteHandleImpl (FileWriteAheadLogManager in earlier versions):

There is an fsync(filePtr) call inside close(); however, fsync() checks the stop flag (which is set inside close) and returns immediately after flushOrWait() if the flag is set - this is very confusing. After all, the close() itself explicitly calls force after flush
There is an ignored IO exception in mmap mode - this should be propagated to the failure handler
In WAL writer, we check for file CLOSE and then attemp to write to (possibly) the same write handle - write should be always before close
In WAL writer, there are racy reads of current handle - it would be better if we read the current handle once and then operate on it during the whole loop iteration

Attachments

Issue Links

blocks

IGNITE-12131 MMAP mode should be disabled by default for WAL manager

In Progress

links to

GitHub Pull Request #6840

Node failure with "Failed to write buffer." error

Activity

People

Assignee:: Dmitriy Govorukhin

Reporter:: Dmitriy Govorukhin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Aug/19 08:49

Updated:: 05/Sep/19 13:09

Resolved:: 05/Sep/19 13:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m