Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-5619

Ozone data corruption issue on Datanodes

    XMLWordPrintableJSON

Details

    Description

      A data corruption issue was recently observed in one of the clusters where  replica of containers were found corrupted. The issue was primarily happening happening bcoz of a race condition among, readStateMachine  /writeStateMachine threads which were reading and writing the chunks concurrently.  Following logs confirm this:

      INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56510
      2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56513
      2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56507
      2021-08-11 2028,542 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
      2021-08-11 2028,543 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
      2021-08-11 2028,544 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
      2021-08-11 2028,545 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56513
      2021-08-11 2028,549 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
      2021-08-11 2028,550 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
      2021-08-11 2028,551 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
      2021-08-11 2028,553 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56513
      2021-08-11 2028,648 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56507
      

      The assumption was till now, that readStateMachine and WriteStateMachine Threads are executed serially on a single thread executor using a hash function on the BlockId which doesn't seem to work well.

      With a file channel, being written/read concurrent threads, will end up writing sparse files, read all 0's , etc and the end result becomes u predictable and cause corrupt data.

      Attachments

        1. repro.patch
          16 kB
          Shashikant Banerjee

        Issue Links

          Activity

            People

              shashikant Shashikant Banerjee
              avijayan Aravindan Vijayan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: