Details
-
Improvement
-
Status: Open
-
Not a Priority
-
Resolution: Unresolved
-
1.5.2
-
None
Description
The ContinuousFileMonitoringFunction class keeps track of the latest file modification time to rule out all files it has processed in the previous cycles. For a long-running job, the list of eligible files will be much smaller than the list of all files in the folder being monitored.
In the current implementation of the getInputSplitsSortedByModTime method, a (big) list of all available splits are created first, and then every single split is checked with the list of eligible files.
for (FileInputSplit split: format.createInputSplits(readerParallelism)) {
FileStatus fileStatus = eligibleFiles.get(split.getPath());
if (fileStatus != null) {
The improvement can be done as:
- Listing of all files should be done once in ContinuousFileMonitoringFunction.listEligibleFiles() (as of now it is done the 2nd time in FileInputFormat.createInputSplits() )
- The list of file-splits should then be created from the list of paths in eligibleFiles.