Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-6648

Timeout Feature - Provided statistical solution to long running/stuck jobs and take appropriate actions

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.10.0
    • Fix Version/s: None
    • Component/s: aws, DAG, database, operators
    • Labels:
      None
    • Environment:
      AWS Linux AMI - Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1027-aws x86_64)

      Description

      Sometimes, across different type of tasks/jobs ,
      one might encounter issues where airflow jobs/tasks get stuck while they are in running state.
      Such issues will cause - Pipeline being stuck for no reason stalling other jobs/tasks which will be a disaster when such issues happen on Production.

      This particular improvement aims to not only improve upon the TIMEOUT logic already in airflow, but to make it more functional and automated.

      Diagrammatically Explanation of the solution -

      Detailed Theoretical Explanation - 

      With increasing Data & Complexity of tasks/job , besides the increasing load, the chances of memory leaks/stuck jobs/some infrastructural issues etc may occur thereby creating some unwanted results.
      Maybe on some day there was more data which resulted in a steep jump in the duration of the job; otherwise, the growth is expected to be gradual.
      And sometimes, the Jobs get stuck because of various issues and often requires termination followed by a restart.
      So, we are trying to make a logic which will automatically decide whether to

      • terminate the Job
      • Terminate and Restart
      • Terminate and Mark as a failure so that downstream jobs don't get triggered.
      • Take no action and inform DevOps regarding the issue ( Manual Action )
        So, I just want to know, statistically, what will be the effective way to achieve the above outcomes.

      Lets Consider 2 Jobs X & Y.

      Jobs related Info -

      Then I was thinking of having a New Table which would be structured as -

      Derived table- 

      ( The above Example is theoretical and actual implementation might differ )

      LIMITATION - 

      1. For now , we have only tested the above on EMR ( Personal Usecase )
      2. Testing Pending for Databricks. ( Personal Usecase )

      Please do suggest any other services where this needs/can be used.

        Attachments

        1. image2019-3-25_12-33-57.png
          49 kB
          Golokesh Patra
        2. image-2020-01-27-17-07-51-822.png
          17 kB
          Golokesh Patra
        3. image-2020-01-27-17-08-09-867.png
          17 kB
          Golokesh Patra
        4. image-2020-01-27-17-08-33-088.png
          9 kB
          Golokesh Patra
        5. image-2020-01-27-17-22-07-433.png
          60 kB
          Golokesh Patra

          Activity

            People

            • Assignee:
              golokeshpatra.patra@gmail.com Golokesh Patra
              Reporter:
              golokeshpatra.patra@gmail.com Golokesh Patra
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified