Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30915

FileStreamSinkLog: Avoid reading the metadata log file when finding the latest batch ID

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: Structured Streaming
    • Labels:
      None

      Description

      FileStreamSink.addBatch checks the latest batch ID before writing outputs to skip writing batch if the batch was committed before.

      While it's valid to compare the current batch with the latest batch ID, getLatest() method is designed to return both the batch ID as well as content which denotes that the latest metadata log file is being read and deserialized. This would introduces heavy latency when the latest batch is a compacted batch.

      We could just find the metadata log file for latest batch ID, and only do the minimal check without reading content.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kabhwan Jungtaek Lim
                Reporter:
                kabhwan Jungtaek Lim
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: