Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30900

FileStreamSource: Avoid reading compact metadata log twice if the query stops from compact batch and restarts

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons.

      This case FileStreamSource will read the compact metadata file twice, one for retrieving all files to build seen file map, another one for retrieving entries in the batch. If the query processes huge number of inputs so far, compact metadata file becomes considerably bigger, so reading once more adds unnecessary latency on processing startup batch.

      This issue tracks the effort to address this case.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: