Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17159

Improve FileInputDStream.findNewFiles list performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • DStreams
    • None
    • spark against object stores

    Description

      FileInputDStream.findNewFiles() is doing a globStatus with a fitler that calls getFileStatus() on every file, takes the output and does listStatus() on the output.

      This going to suffer on object stores, as dir listing and getFileStatus calls are so expensive. It's clear this is a problem, as the method has code to detect timeouts in the window and warn of problems.

      It should be possible to make this faster

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: