Details
-
Improvement
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
-
None
Description
There are two types of information that you want to be able to extract from a running query: instantaneous status and metrics about the performance as make progress in query processing.
Today, these are conflated in a single StreamingQueryStatus object. The downside to this approach is that a user now needs to reason about what state the query is in anytime they retrieve a status object. Fields like statusMessage don't appear in updates that come from listener bus. Simlarly, inputRate/processingRate statistics are usually 0 when you retrieve a status object from the query itself.
I propose we make the follow changes:
- Make status only report instantaneous things, such as if data is available or a human readable message about what phase we are currently in.
- Have a separate progress message that we report for each trigger with the other performance information that lives in status today. You should be able to easily retrieve a configurable number of the most recent progress messages instead of just the most recent.
While we are making these changes, I propose that we also change id to be a globally unique identifier, rather than a JVM unique one. Without this its hard to correlate performance across restarts.
Attachments
Issue Links
- supercedes
-
SPARK-18474 Add StreamingQuery.status in python
- Closed
- links to