Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17569

Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • None
    • None

    Description

      Structured Streaming's FileSource lists files to classify files as Offsets. Once this file list is committed to a metadata log for a batch, this file list is turned into a "Batch FileSource" Relation which acts as the source to the incremental execution.

      While this "Batch FileSource" Relation is resolved, we re-check that every single file exists on the Driver. It takes a horrible amount of time, and is a total waste. We can simply skip file existence during execution.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            brkyvz Burak Yavuz
            brkyvz Burak Yavuz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment