Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9940

File source continuous monitoring mode: S3 files sometimes missed

    XMLWordPrintableJSON

Details

    Description

      When using StreamExecutionEnvironment.readFile() with FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if there is a high amount of new/modified files at the same time, the directory monitoring process might miss some files. The number of missing files depends on the monitoring interval.

      Cause: Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. So when there are multiple files having a same last-modified timestamp.

      Suggested solution (thanks to [Fabian Hueske): a hybrid approach that keeps the names of all files that have a mod timestamp that is larger than the max mod time minus an offset. org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction

      Attachments

        Issue Links

          Activity

            People

              Averell Huyen Levan
              Averell Huyen Levan
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: