Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
A data corruption issue was recently observed in one of the clusters where replica of containers were found corrupted. The issue was primarily happening happening bcoz of a race condition among, readStateMachine /writeStateMachine threads which were reading and writing the chunks concurrently. Following logs confirm this:
INFO ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56510 2021-08-11 2028,524 [ChunkWriter-1-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56513 2021-08-11 2028,524 [ChunkWriter-1-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56507 2021-08-11 2028,542 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510 2021-08-11 2028,543 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510 2021-08-11 2028,544 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510 2021-08-11 2028,545 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56513 2021-08-11 2028,549 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510 2021-08-11 2028,550 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510 2021-08-11 2028,551 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510 2021-08-11 2028,553 [ChunkWriter-0-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56513 2021-08-11 2028,648 [ChunkWriter-3-0] INFO ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56507
The assumption was till now, that readStateMachine and WriteStateMachine Threads are executed serially on a single thread executor using a hash function on the BlockId which doesn't seem to work well.
With a file channel, being written/read concurrent threads, will end up writing sparse files, read all 0's , etc and the end result becomes u predictable and cause corrupt data.
Attachments
Attachments
Issue Links
- links to