Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18516

Separate instantaneous state from progress performance statistics

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.1.0
    • Structured Streaming
    • None

    Description

      There are two types of information that you want to be able to extract from a running query: instantaneous status and metrics about the performance as make progress in query processing.

      Today, these are conflated in a single StreamingQueryStatus object. The downside to this approach is that a user now needs to reason about what state the query is in anytime they retrieve a status object. Fields like statusMessage don't appear in updates that come from listener bus. Simlarly, inputRate/processingRate statistics are usually 0 when you retrieve a status object from the query itself.

      I propose we make the follow changes:

      • Make status only report instantaneous things, such as if data is available or a human readable message about what phase we are currently in.
      • Have a separate progress message that we report for each trigger with the other performance information that lives in status today. You should be able to easily retrieve a configurable number of the most recent progress messages instead of just the most recent.

      While we are making these changes, I propose that we also change id to be a globally unique identifier, rather than a JVM unique one. Without this its hard to correlate performance across restarts.

      Attachments

        Issue Links

          Activity

            People

              marmbrus Michael Armbrust
              marmbrus Michael Armbrust
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: