Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15312

FileRawSnapshotWriter must flush before atomic move

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.3.3, 3.6.0, 3.4.2, 3.5.2
    • kraft
    • None

    Description

      On ext4 file systems it is possible for KRaft to create zero-length snapshot files. Not all file system fsync to disk on close. For KRaft to guarantee that the data has made it to disk before calling rename, it needs to make sure that the file has been fsync.

      We have seen cases were the snapshot file has zero-length data on ext4 file system.

       "Delayed allocation" means that the filesystem tries to delay the allocation of physical disk blocks for written data for as long as possible. This policy brings some important performance benefits. Many files are short-lived; delayed allocation can keep the system from writing fleeting temporary files to disk at all. And, for longer-lived files, delayed allocation allows the kernel to accumulate more data and to allocate the blocks for data contiguously, speeding up both the write and any subsequent reads of that data. It's an important optimization which is found in most contemporary filesystems.

      But, if blocks have not been allocated for a file, there is no need to write them quickly as a security measure. Since the blocks do not yet exist, it is not possible to read somebody else's data from them. So ext4 will not (cannot) write out unallocated blocks as part of the next journal commit cycle. Those blocks will, instead, wait until the kernel decides to flush them out; at that point, physical blocks will be allocated on disk and the data will be made persistent. The kernel doesn't like to let file data sit unwritten for too long, but it can still take a minute or so (with the default settings) for that data to be flushed - far longer than the five seconds normally seen with ext3. And that is why a crash can cause the loss of quite a bit more data when ext4 is being used. 

      from: https://lwn.net/Articles/322823/

      auto_da_alloc ( * ), noauto_da_alloc

      Many broken applications don't use fsync() when replacing existing files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks of the new file are forced to disk before the rename() operation is committed. This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation blocks are forced to disk.

      from: https://www.kernel.org/doc/html/latest/admin-guide/ext4.html

       

      Attachments

        Issue Links

          Activity

            People

              jsancio José Armando García Sancio
              jsancio José Armando García Sancio
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: