Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.5.1
-
None
-
Flink 1.5, EMRFS
Description
When using StreamExecutionEnvironment.readFile() with FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if there is a high amount of new/modified files at the same time, the directory monitoring process might miss some files. The number of missing files depends on the monitoring interval.
Cause: Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. So when there are multiple files having a same last-modified timestamp.
Suggested solution (thanks to [Fabian Hueske): a hybrid approach that keeps the names of all files that have a mod timestamp that is larger than the max mod time minus an offset. org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction
Attachments
Issue Links
- is duplicated by
-
FLINK-8046 ContinuousFileMonitoringFunction wrongly ignores files with exact same timestamp
- Closed
- links to