Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-1494

backfill job failed because new retry comes out when a valid job is running

    Details

    • Type: Bug
    • Status: Open
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: Airflow 1.8
    • Fix Version/s: None
    • Component/s: backfill
    • Labels:
      None
    • Environment:
      ubuntu

      Description

      I have a spark job, wrapped in a BASH command to run.
      From fail log, i found airflow try to rerun the job while the job is running. Then a series strange things happened. My job finally failed. Logs Below:

      [2017-08-07 23:24:35,903]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:35,903]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:35 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:36,904]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:36,904]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:36 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:37,905]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:37,904]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:37 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:38,906]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:38,906]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:38 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:38,947]

      {cli.py:391}

      INFO - Loading pickle id 148
      [2017-08-07 23:24:39,020]

      {base_task_runner.py:112}

      INFO - Running: ['bash', '-c', u'airflow run generated_daily submit_operator 2017-08-05T00:00:00 --pickle 148 --job_id 15372 --raw']
      [2017-08-07 23:24:39,426]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,426] {_init_.py:57} INFO - Using executor CeleryExecutor
      [2017-08-07 23:24:39,495]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,494]

      {driver.py:120}

      INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
      [2017-08-07 23:24:39,519]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,519]

      {driver.py:120}

      INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
      [2017-08-07 23:24:39,710]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,710]

      {cli.py:391}

      INFO - Loading pickle id 148
      [2017-08-07 23:24:39,772]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,772]

      {models.py:1120}

      INFO - Dependencies not met for <TaskInstance: generated_daily.submit_operator 2017-08-05 00:00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2017-08-07 20:55:12.727910.
      [2017-08-07 23:24:39,777]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,777]

      {models.py:1120}

      INFO - Dependencies not met for <TaskInstance: generated_daily.submit_operator 2017-08-05 00:00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
      [2017-08-07 23:24:39,907]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:39,907]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:39 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:40,908]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:40,908]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:40 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:41,909]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:41,909]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:41 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:42,911]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:42,910]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:42 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:43,912]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:43,912]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:43 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:44,120]

      {jobs.py:2148}

      WARNING - Recorded pid 116446 is not a descendant of the current pid 33416
      [2017-08-07 23:24:44,914]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:44,913]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:44 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:45,914]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:45,914]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:45 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNNING)
      [2017-08-07 23:24:46,032]

      {jobs.py:2155}

      WARNING - State of this instance has been externally set to up_for_retry. Taking the poison pill. So long.
      [2017-08-07 23:24:46,081]

      {helpers.py:220}

      WARNING - Terminating descendant processes of ['/usr/bin/python /usr/local/bin/airflow run generated_daily submit_operator 2017-08-05T00:00:00 --pickle 148 --job_id 15015 --raw'] PID: 116446
      [2017-08-07 23:24:46,081]

      {helpers.py:224}

      WARNING - Terminating descendant process ['bash', '/tmp/airflowtmpCFeIBn/submit_operatorGbO01R'] PID: 116496
      [2017-08-07 23:24:46,087]

      {helpers.py:224}

      WARNING - Terminating descendant process ['/bin/sh', '/mnt/data/firework-0.2.0/bin/spark-submit', '--spark-version', '1.6.2', '--master', 'yarn-cluster', '--num-executors', '128', '--class', 'com.hulu.reco.metrics.middlelayer.BatchApplication', '--conf', 'spark.date=20170805', '--conf', 'spark.yarn.maxAppAttempts=1', '--conf', 'spark.driver.memory=8g', '--conf', 'spark.executor.memory=36g', '--conf', 'spark.executor.cores=8', '--conf', 'spark.user.javahome.enabled=true', '--conf', 'spark.user.javahome.path=/usr/lib/jvm/hulu-oracle-jdk8', '--conf', 'spark.output=hdfs://warehousestore/user/pcdm/generated_prod/20170805', '--queue', 'spark', '/home/deploy/middlelayer-1.0-SNAPSHOT-jar-with-dependencies.jar'] PID: 116498
      [2017-08-07 23:24:46,091]

      {helpers.py:224}

      WARNING - Terminating descendant process ['/usr/java/jdk8/jdk1.8.0_92-1//bin/java', '-cp', '/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/conf/:/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.6.0-cdh5.7.3-201612201803.jar:/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/lib_managed/jars/datanucleus-core-3.2.10.jar:/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/mnt/data/firework-0.2.0/.firework_cache/spark/els/1.6.2/b72e04b6120cd0a431aae3195db97706/hadoop-conf/', 'org.apache.spark.deploy.SparkSubmit', '--master', 'yarn-cluster', '--conf', 'spark.executor.memory=36g', '--conf', 'spark.driver.memory=8g', '--conf', 'spark.user.javahome.path=/usr/lib/jvm/hulu-oracle-jdk8', '--conf', 'spark.output=hdfs://warehousestore/user/pcdm/generated_prod/20170805', '--conf', 'spark.yarn.maxAppAttempts=1', '--conf', 'spark.executor.cores=8', '--conf', 'spark.date=20170805', '--conf', 'spark.user.javahome.enabled=true', '--class', 'com.hulu.reco.metrics.middlelayer.BatchApplication', '--num-executors', '128', '--queue', 'spark', '/home/deploy/middlelayer-1.0-SNAPSHOT-jar-with-dependencies.jar'] PID: 116761
      [2017-08-07 23:24:46,096]

      {helpers.py:231}

      WARNING - Waiting up to 5s for processes to exit...
      [2017-08-07 23:24:46,099]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,099]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:46 INFO util.ShutdownHookManager: Shutdown hook called
      [2017-08-07 23:24:46,100]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,100]

      {bash_operator.py:94}

      INFO - 17/08/07 23:24:46 INFO util.ShutdownHookManager: Deleting directory /mnt/data/tmp/spark-b5f3c795-7ea2-4ef1-a47c-2507b0747580
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,118]

      {bash_operator.py:97}

      INFO - Command exited with return code -15
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,118]

      {models.py:1417}

      ERROR - Bash command failed
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: Traceback (most recent call last):
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: result = task_copy.execute(context=context)
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/operators/bash_operator.py", line 100, in execute
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: raise AirflowException("Bash command failed")
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: AirflowException: Bash command failed
      [2017-08-07 23:24:46,121]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,120]

      {models.py:1433}

      INFO - Marking task as UP_FOR_RETRY
      [2017-08-07 23:24:46,123]

      {helpers.py:234}

      WARNING - Done waiting
      [2017-08-07 23:24:46,131]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:24:46,130]

      {models.py:1462}

      ERROR - Bash command failed
      [2017-08-07 23:24:46,131]

      {base_task_runner.py:95}

      INFO - Subtask: Traceback (most recent call last):
      [2017-08-07 23:24:46,131]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/bin/airflow", line 28, in <module>
      [2017-08-07 23:24:46,131]

      {base_task_runner.py:95}

      INFO - Subtask: args.func(args)
      [2017-08-07 23:24:46,131]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 422, in run
      [2017-08-07 23:24:46,132]

      {base_task_runner.py:95}

      INFO - Subtask: pool=args.pool,
      [2017-08-07 23:24:46,132]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
      [2017-08-07 23:24:46,132]

      {base_task_runner.py:95}

      INFO - Subtask: result = func(*args, **kwargs)
      [2017-08-07 23:24:46,132]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
      [2017-08-07 23:24:46,132]

      {base_task_runner.py:95}

      INFO - Subtask: result = task_copy.execute(context=context)
      [2017-08-07 23:24:46,133]

      {base_task_runner.py:95}

      INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/operators/bash_operator.py", line 100, in execute
      [2017-08-07 23:24:46,133]

      {base_task_runner.py:95}

      INFO - Subtask: raise AirflowException("Bash command failed")
      [2017-08-07 23:24:46,133]

      {base_task_runner.py:95}

      INFO - Subtask: airflow.exceptions.AirflowException: Bash command failed
      [2017-08-07 23:24:51,034]

      {jobs.py:2083}

      INFO - Task exited with return code 1
      [2017-08-07 23:33:27,872]

      {cli.py:391}

      INFO - Loading pickle id 148
      [2017-08-07 23:33:27,940]

      {base_task_runner.py:112}

      INFO - Running: ['bash', '-c', u'airflow run generated_daily submit_operator 2017-08-05T00:00:00 --pickle 148 --job_id 15376 --raw']
      [2017-08-07 23:33:28,302]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:33:28,301] {_init_.py:57} INFO - Using executor CeleryExecutor
      [2017-08-07 23:33:28,359]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:33:28,358]

      {driver.py:120}

      INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
      [2017-08-07 23:33:28,378]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:33:28,378]

      {driver.py:120}

      INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
      [2017-08-07 23:33:28,543]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:33:28,543]

      {cli.py:391}

      INFO - Loading pickle id 148
      [2017-08-07 23:33:28,601]

      {base_task_runner.py:95}

      INFO - Subtask: [2017-08-07 23:33:28,601]

      {models.py:1120}

      INFO - Dependencies not met for <TaskInstance: generated_daily.submit_operator 2017-08-05 00:00:00 [up_for_retry]>, dependency 'Not In Retry Period' FAILED: Task is not ready for retry yet but will be retried automatically. Current date is 2017-08-07T23:33:28.601339 and task will be retried at 2017-08-08T00:24:46.119758.
      [2017-08-07 23:33:32,954]

      {jobs.py:2083}

      INFO - Task exited with return code 0

        Activity

        Hide
        xianping Xianping lin added a comment - - edited

        job start and see 'load pickle': [2017-08-07 20:55:11,854]

        {cli.py:391}

        INFO - Loading pickle id 148
        ==
        long time running...
        ==
        [2017-08-07 23:24:38,906]

        {base_task_runner.py:95}

        INFO - Subtask: [2017-08-07 23:24:38,906]

        {bash_operator.py:94}

        INFO - 17/08/07 23:24:38 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNN ING)
        *then Second 'load pickle' : *
        [2017-08-07 23:24:38,947]

        {cli.py:391}

        INFO - Loading pickle id 148 . In this time, the log is reporting job running [what happened here?]

        Show
        xianping Xianping lin added a comment - - edited job start and see 'load pickle': [2017-08-07 20:55:11,854] {cli.py:391} INFO - Loading pickle id 148 == long time running... == [2017-08-07 23:24:38,906] {base_task_runner.py:95} INFO - Subtask: [2017-08-07 23:24:38,906] {bash_operator.py:94} INFO - 17/08/07 23:24:38 INFO yarn.Client: Application report for application_1498090868254_911911 (state: RUNN ING) *then Second 'load pickle' : * [2017-08-07 23:24:38,947] {cli.py:391} INFO - Loading pickle id 148 . In this time, the log is reporting job running [what happened here?]

          People

          • Assignee:
            Unassigned
            Reporter:
            xianping Xianping lin
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development