Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-896

Livy could intermittently returns batch as SUCCEED even Spark on Yarn actually fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 0.8.0
    • Server
    • None

    Description

      Summary:

      • I ran into this issue using AWS EMR.
      • Frequency of the issue varies. On one EMR cluster, I typically see ~10-20% chance of hitting the issue. But on another EMR cluster, the chance is ~1%. I suspect the chance depends on how busy AWS hardware actually was (my EMR likely share hardware resources with other AWS tenants).
      • I believe that I have identify the root cause in Livy source code (refer to a later section).

       

      How to reproduce:

      • An EMR with Spark, Yarn and Livy configured.
      • Use the attached livy_batch.py to trigger a Livy batch by using livy python client (0.8.0). See attached livy_client.py.
      • Repeat the testing and you should see when the issue happens, even though the spark program errors out, Livy still reports the batch as SUCCEED.

       

      Livy log for a good case when Livy returns batch as DEAD (expected behavior):
      22/10/14 02:46:22 INFO BatchSessionManager: Registered new session 1
      22/10/14 02:46:42 DEBUG BatchSession: BatchSession 1 state changed from STARTING to RUNNING
      22/10/14 02:46:43 WARN BatchSession$: spark-submit exited with code 1
      22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from RUNNING to FINISHED
      22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from FINISHED to FAILED
       

      Livy log for bad case when Livy returns batch as SUCCEED (bug):
      22/10/14 02:47:40 INFO BatchSessionManager: Registered new session 3
      22/10/14 02:48:00 DEBUG BatchSession: BatchSession 3 state changed from STARTING to FINISHED
      22/10/14 02:48:01 WARN BatchSession$: spark-submit exited with code 1
       

      Root cause analysis:

      Even without hitting the timing condition, the code logic itself is still incorrect.

      If you take a look at the log from a "good" case, the session state was updated twice: FINISHED, then FAILED. If a client query arrives on the perfect timing, the livy server could can still return a wrong state.

      22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from RUNNING to FINISHED 
      22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from FINISHED to FAILED

       

      I hope we can work together to have the issue addressed ASAP as the bug hit our production code pretty bad. I think the right code logic should be:

      1. read the spark-submit process's state, if still running, do nothing
      2. If the spark-submit process finishes, read Yarn report, and determines the actual application finish state in a single shot.
      3. Update the session state in a single step.

       

      At the same time, I will see if I can create a PR with suggested fix soon. The challenge on my side is that it's almost impossible for me to swap a few jars from open-source code base on AWS EMR (not compatible with EMR runtime).

       

      Thank you, Livy team!

      Regards,
      Jeff Xu, a Workday engineer

      Attachments

        1. livy_batch.py
          0.5 kB
          Jeff Xu
        2. livy_client.py
          0.3 kB
          Jeff Xu
        3. Screen Shot 2022-11-25 at 6.31.30 PM.png
          88 kB
          Jeff Xu

        Activity

          People

            jeff.xu.z@gmail.com Jeff Xu
            jeff.xu.z@gmail.com Jeff Xu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m