Details
-
Improvement
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
None
Description
The Beam job state transitions are ill-defined, which is big problem for anything that relies on the values coming from JobAPI.GetStateStream.
I was hoping to find something like a state transition diagram in the docs so that I could determine the start state, the terminal states, and the valid transitions, but I could not find this. The code reveals that the SDKs differ on the fundamentals:
Java InMemoryJobService:
- start state: STOPPED
- run - about to submit to executor: STARTING
- run - actually running on executor: RUNNING
- terminal states: DONE, FAILED, CANCELLED, DRAINED
Python AbstractJobServiceServicer / LocalJobServicer:
- start state: STARTING
- terminal states: DONE, FAILED, CANCELLED, STOPPED
I think it would be good to make python work like Java, so that there is a difference in state between a job that has been prepared and one that has additionally been run.
It's hard to tell how far this problem has spread within the various runners. I think a simple thing that can be done to help standardize behavior is to implement the terminal states as an enum in the beam_job_api.proto, or create a utility function in each language for checking if a state is terminal, so that it's not left up to each runner to reimplement this logic.