Sometimes, across different type of tasks/jobs ,
one might encounter issues where airflow jobs/tasks get stuck while they are in running state.
Such issues will cause - Pipeline being stuck for no reason stalling other jobs/tasks which will be a disaster when such issues happen on Production.
This particular improvement aims to not only improve upon the TIMEOUT logic already in airflow, but to make it more functional and automated.
Diagrammatically Explanation of the solution -
Detailed Theoretical Explanation -
With increasing Data & Complexity of tasks/job , besides the increasing load, the chances of memory leaks/stuck jobs/some infrastructural issues etc may occur thereby creating some unwanted results.
Maybe on some day there was more data which resulted in a steep jump in the duration of the job; otherwise, the growth is expected to be gradual.
And sometimes, the Jobs get stuck because of various issues and often requires termination followed by a restart.
So, we are trying to make a logic which will automatically decide whether to
- terminate the Job
- Terminate and Restart
- Terminate and Mark as a failure so that downstream jobs don't get triggered.
- Take no action and inform DevOps regarding the issue ( Manual Action )
So, I just want to know, statistically, what will be the effective way to achieve the above outcomes.
Lets Consider 2 Jobs X & Y.
Jobs related Info -
Then I was thinking of having a New Table which would be structured as -
Derived table-
( The above Example is theoretical and actual implementation might differ )
LIMITATION -
- For now , we have only tested the above on EMR ( Personal Usecase )
- Testing Pending for Databricks. ( Personal Usecase )
Please do suggest any other services where this needs/can be used.