Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-3077

Cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7.3
    • 0.8.0
    • None
    • None

    Description

      The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck.

      I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job.

      Here is the contents of the thread dump:

      "DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
              at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
              - locked <0x00000000c0a7dbf0> (a java.lang.Object)
      
         Locked ownable synchronizers:
              - None
      
      "DefaultQuartzScheduler_Worker-9" #75 prio=5 os_prio=0 tid=0x00007fb41d3b2000 nid=0x1b520 waiting on condition [0x00007fb3daff2000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
              at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
              - locked <0x00000000c0a7a470> (a java.lang.Object)
      
         Locked ownable synchronizers:
              - None
      
      ...
      
      "DefaultQuartzScheduler_Worker-2" #68 prio=5 os_prio=0 tid=0x00007fb41d3c8800 nid=0x1b519 waiting on condition [0x00007fb3da473000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
              at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
              - locked <0x00000000c0a7a7b0> (a java.lang.Object)
      
         Locked ownable synchronizers:
              - None
      
      "DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
              at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
              - locked <0x00000000c0a7dd90> (a java.lang.Object)
      
         Locked ownable synchronizers:
              - None
      

      The above thread dump says that all of the worker threads get stuck at https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889.

      One way to reproduce this kind of issue is setting a paragraph whose status is "READY" to "disable run". That makes the paragraph status "READY" permanently and "note.isTerminated()" never turns to true.

      To fix this issue, I will make the following two improvements:

      1) Remove the unnecessary `while (!note.isTerminated())

      { ... }

      ` block because the execution of all of the paragraphs is finished after `note.runAll()`.
      2) Skip the cron execution if there is a running or pending paragraph. This prevents the Zeppelin cron scheduler from getting stuck by the long running paragraph whose execution duration is greater than the cron execution cycle.

      Attachments

        Issue Links

          Activity

            People

              kjmrknsn Keiji Yoshida
              kjmrknsn Keiji Yoshida
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: