Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30866

FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      FileStreamSource fetches the available files per batch which is a "heavy cost" operation.

      (E.g. It took around 5 seconds to list leaf files for 95 paths which contain 674,811 files. It's not even in HDFS path - it's local filesystem.)

      If "maxFilesPerTrigger" is not set, Spark would consume all the fetched files. After the batch has been completed, it's obvious for Spark to fetch per micro batch.

      If "latestFirst" is true (regardless of "maxFilesPerTrigger"), the files to process should be updated per batch, so it's also obvious for Spark to fetch per micro batch.

      Except above cases (in short, maxFilesPerTrigger is being set and latestFirst is false), the files to process can be "continuous" - we can "cache" the fetched list of files and consume until the list has been exhausted.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment