Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12022

Enable StreamWriter to update file length on sync flush

    XMLWordPrintableJSON

Details

    Description

      Currently, users of file systems that do not support truncating have to struggle with BucketingSink and use its valid length file to indicate the checkpointed data position. The problem is that by default the file length will only be updated when a block is full or the file is closed, but when the job crashes and the file is not closed properly, the file length is still behind its actual value and the checkpointed file length. When the job restarts, it looks like data loss, because the valid length is bigger than the file. This situation lasts until namenode notices the change of block size of the file, and it could be half an hour or more.

      So I propose to add an option to StreamWriterBase to update file lengths on each flush. This can be expensive because it involves namenode and should be used when strong consistency is needed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul Lin Paul Lin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: