Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10518

Inefficient design in ContinuousFileMonitoringFunction

    XMLWordPrintableJSON

Details

    Description

      The ContinuousFileMonitoringFunction class keeps track of the latest file modification time to rule out all files it has processed in the previous cycles. For a long-running job, the list of eligible files will be much smaller than the list of all files in the folder being monitored.
      In the current implementation of the getInputSplitsSortedByModTime method, a (big) list of all available splits are created first, and then every single split is checked with the list of eligible files.

      for (FileInputSplit split: format.createInputSplits(readerParallelism)) {
      FileStatus fileStatus = eligibleFiles.get(split.getPath());
      if (fileStatus != null) {

      The improvement can be done as:

      • Listing of all files should be done once in ContinuousFileMonitoringFunction.listEligibleFiles() (as of now it is done the 2nd time in FileInputFormat.createInputSplits() )
      • The list of file-splits should then be created from the list of paths in eligibleFiles.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Averell Huyen Levan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: