Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10518

Inefficient design in ContinuousFileMonitoringFunction

    Details

      Description

      The ContinuousFileMonitoringFunction class keeps track of the latest file modification time to rule out all files it has processed in the previous cycles. For a long-running job, the list of eligible files will be much smaller than the list of all files in the folder being monitored.
      In the current implementation of the getInputSplitsSortedByModTime method, a (big) list of all available splits are created first, and then every single split is checked with the list of eligible files.

      for (FileInputSplit split: format.createInputSplits(readerParallelism)) {
      FileStatus fileStatus = eligibleFiles.get(split.getPath());
      if (fileStatus != null) {

      The improvement can be done as:

      • Listing of all files should be done once in ContinuousFileMonitoringFunction.listEligibleFiles() (as of now it is done the 2nd time in FileInputFormat.createInputSplits() )
      • The list of file-splits should then be created from the list of paths in eligibleFiles.

        Attachments

          Activity

            People

            • Assignee:
              guibopan Guibo Pan
              Reporter:
              Averell Huyen Levan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: