Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20928

SPIP: Continuous Processing Mode for Structured Streaming

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Structured Streaming

    Description

      Given the current Source API, the minimum possible latency for any record is bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that getBatch requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time.

      For applications where latency is more important than exactly-once output however, it would be useful if processing could happen continuously. This would allow us to achieve fully pipelined reading and writing from sources such as Kafka. This kind of architecture would make it possible to process records with end-to-end latencies on the order of 1 ms, rather than the 10-100ms that is possible today.

      One possible architecture here would be to change the Source API to look like the following rough sketch:

        trait Epoch {
          def data: DataFrame
      
          /** The exclusive starting position for `data`. */
          def startOffset: Offset
      
          /** The inclusive ending position for `data`.  Incrementally updated during processing, but not complete until execution of the query plan in `data` is finished. */
          def endOffset: Offset
        }
      
        def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], limits: Limits): Epoch
      

      The above would allow us to build an alternative implementation of StreamExecution that processes continuously with much lower latency and only stops processing when needing to reconfigure the stream (either due to a failure or a user requested change in parallelism.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joseph.torres Jose Torres
            marmbrus Michael Armbrust
            Votes:
            24 Vote for this issue
            Watchers:
            119 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment