Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30915

FileStreamSinkLog: Avoid reading the metadata log file when finding the latest batch ID

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      FileStreamSink.addBatch checks the latest batch ID before writing outputs to skip writing batch if the batch was committed before.

      While it's valid to compare the current batch with the latest batch ID, getLatest() method is designed to return both the batch ID as well as content which denotes that the latest metadata log file is being read and deserialized. This would introduces heavy latency when the latest batch is a compacted batch.

      We could just find the metadata log file for latest batch ID, and only do the minimal check without reading content.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: