Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17569

Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • None
    • None

    Description

      Structured Streaming's FileSource lists files to classify files as Offsets. Once this file list is committed to a metadata log for a batch, this file list is turned into a "Batch FileSource" Relation which acts as the source to the incremental execution.

      While this "Batch FileSource" Relation is resolved, we re-check that every single file exists on the Driver. It takes a horrible amount of time, and is a total waste. We can simply skip file existence during execution.

      Attachments

        Activity

          People

            brkyvz Burak Yavuz
            brkyvz Burak Yavuz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: