Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-4910

KuberenetesExecutor - KubernetesJobWatcher can silently fail

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.10.3
    • Fix Version/s: None
    • Component/s: executors
    • Labels:

      Description

      After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.

      My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.

      Airflow had log messages such as these:

      {{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 179 task instances in queue
      {{jobs.py:1210}} DEBUG - Not handling task ('example_python_operator', 'print_the_context', datetime.datetime(2019, 6, 7, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) as the executor reports it is running

      ... and a bit further down:

      {{base_executor.py:124}} DEBUG - 32 running task instances

      In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).

      After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.

      It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the heartbeat. If the Watch threw an exception, there should have been logs (which there weren't) and then it should have restarted.

      I have a few theories as to what might have happened:

      1. The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
      2. The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
      3. The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state

      Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.

      As a side note - Im using Kubernetes Client version 9.0.0

      My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).

        Attachments

          Activity

            People

            • Assignee:
              dimberman Daniel Imberman
              Reporter:
              Stephens Sam Stephens
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: