Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1795

Assertion failure in state abstraction crashes JVM

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.22.1, 0.23.0
    • Component/s: java api
    • Labels:
      None

      Description

      Observed the following log output prior to a crash of the Marathon scheduler:

      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] CHECK_READY(*future): is PENDING
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure stack trace: ***
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663a2d google::LogMessage::Fail()
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26657e3 google::LogMessage::SendToLog()
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663648 google::LogMessage::Flush()
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc266603e google::LogMessageFatal::~LogMessageFatal()
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26588a3 Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
      Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febcd107d98 (unknown)

      Listing 1: Crash log output.

        Issue Links

          Activity

          Hide
          cdoyle Connor Doyle added a comment -

          Review for a fix is in progress here: https://reviews.apache.org/r/25614

          Show
          cdoyle Connor Doyle added a comment - Review for a fix is in progress here: https://reviews.apache.org/r/25614
          Hide
          bmahler Benjamin Mahler added a comment -

          Do you understand what transpired?

          Show
          bmahler Benjamin Mahler added a comment - Do you understand what transpired?
          Hide
          cdoyle Connor Doyle added a comment -

          Benjamin Mahler I believe so, and those conjectures are outlined in the review. However if you or others can shed more light on the cause that would be great. Unfortunately this has been a difficult issue to reproduce.

          Show
          cdoyle Connor Doyle added a comment - Benjamin Mahler I believe so, and those conjectures are outlined in the review. However if you or others can shed more light on the cause that would be great. Unfortunately this has been a difficult issue to reproduce.
          Hide
          jieyu Jie Yu added a comment -

          Could you please give the compiler version and linux distribution?

          Show
          jieyu Jie Yu added a comment - Could you please give the compiler version and linux distribution?
          Hide
          sanarayanan Sathiya Narayanan added a comment -

          We also faced the same issue with mesos 0.20 while running marathon. You might reproduce this issue by increasing the number of apps in marathon.

          Linux distribution info:
          Distributor ID: Ubuntu
          Description: Ubuntu 12.04 LTS
          Release: 12.04
          Codename: precise

          When can we expect the fix for this issue ?

          Show
          sanarayanan Sathiya Narayanan added a comment - We also faced the same issue with mesos 0.20 while running marathon. You might reproduce this issue by increasing the number of apps in marathon. Linux distribution info: Distributor ID: Ubuntu Description: Ubuntu 12.04 LTS Release: 12.04 Codename: precise When can we expect the fix for this issue ?
          Hide
          cdoyle Connor Doyle added a comment -

          Another person has reported a similar issue on the Marathon issues list here: https://github.com/mesosphere/marathon/issues/834.

          Show
          cdoyle Connor Doyle added a comment - Another person has reported a similar issue on the Marathon issues list here: https://github.com/mesosphere/marathon/issues/834 .
          Hide
          mcabalaji Balaji added a comment - - edited

          Hi

          I am able to reproduce consistently the mentioned issue. I run Chronos with 25 jobs that run infinitely on a Single Mesos Server and the Chronos stops after running for 3 hours throwing the below error

          Jan 6, 2015 11:04:55 AM com.airbnb.scheduler.state.MesosStatePersistenceStore persistData
          INFO: Key for state exists already: J_GA Process data aggregate
          F0106 11:04:55.997716 10688 org_apache_mesos_state_AbstractState.cpp:330] CHECK_READY(*future): is PENDING
          ***Check failure stack trace: ***
          @ 0x7fbf42b5db60 google::LogMessage::Fail()
          Jan 6, 2015 11:04:56 AM mesosphere.chaos.http.ChaosRequestLog write
          INFO: 124.30.96.196 - audience [06/Jan/2015:11:04:55 +0000] "GET /scheduler/jobs HTTP/1.1" 200 12485 "http://54.174.92.160:8081/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36"
          @ 0x7fbf42b5daac google::LogMessage::SendToLog()
          @ 0x7fbf42b5d4ae google::LogMessage::Flush()
          @ 0x7fbf42b603c2 google::LogMessageFatal::~LogMessageFatal()
          @ 0x7fbf420eb3be _CheckFatal::~_CheckFatal()
          @ 0x7fbf42b4fb52 Java_org_apache_mesos_state_AbstractState__1_1store_1get
          @ 0x7fbf75391148 (unknown)
          Aborted (core dumped)

          Mesos Version : 0.21.0
          Chronos : Chronos-2.1.0_mesos-0.14.0-rc4
          Ubuntu : 12.04
          RAM : 8 GB

          cc: Connor Doyle, Benjamin Mahler, Jie Yu

          Show
          mcabalaji Balaji added a comment - - edited Hi I am able to reproduce consistently the mentioned issue. I run Chronos with 25 jobs that run infinitely on a Single Mesos Server and the Chronos stops after running for 3 hours throwing the below error Jan 6, 2015 11:04:55 AM com.airbnb.scheduler.state.MesosStatePersistenceStore persistData INFO: Key for state exists already: J_GA Process data aggregate F0106 11:04:55.997716 10688 org_apache_mesos_state_AbstractState.cpp:330] CHECK_READY(*future): is PENDING ***Check failure stack trace: *** @ 0x7fbf42b5db60 google::LogMessage::Fail() Jan 6, 2015 11:04:56 AM mesosphere.chaos.http.ChaosRequestLog write INFO: 124.30.96.196 - audience [06/Jan/2015:11:04:55 +0000] "GET /scheduler/jobs HTTP/1.1" 200 12485 "http://54.174.92.160:8081/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36" @ 0x7fbf42b5daac google::LogMessage::SendToLog() @ 0x7fbf42b5d4ae google::LogMessage::Flush() @ 0x7fbf42b603c2 google::LogMessageFatal::~LogMessageFatal() @ 0x7fbf420eb3be _CheckFatal::~_CheckFatal() @ 0x7fbf42b4fb52 Java_org_apache_mesos_state_AbstractState__1_1store_1get @ 0x7fbf75391148 (unknown) Aborted (core dumped) Mesos Version : 0.21.0 Chronos : Chronos-2.1.0_mesos-0.14.0-rc4 Ubuntu : 12.04 RAM : 8 GB cc: Connor Doyle , Benjamin Mahler , Jie Yu
          Hide
          drexin Dario Rexin added a comment -

          Any update on this?

          Show
          drexin Dario Rexin added a comment - Any update on this?
          Hide
          benjaminhindman Benjamin Hindman added a comment -

          commit 5d0b5fdb5d8d78b44537cb01916263a3769b5d7d
          Author: Joris Van Remoortere <joris.van.remoortere@gmail.com>
          Date: Sun Mar 29 12:20:24 2015 -0700

          Fix memory corruption in AbstractState JNI bindings. MESOS-2161.

          Review: https://reviews.apache.org/r/32152

          Show
          benjaminhindman Benjamin Hindman added a comment - commit 5d0b5fdb5d8d78b44537cb01916263a3769b5d7d Author: Joris Van Remoortere <joris.van.remoortere@gmail.com> Date: Sun Mar 29 12:20:24 2015 -0700 Fix memory corruption in AbstractState JNI bindings. MESOS-2161 . Review: https://reviews.apache.org/r/32152

            People

            • Assignee:
              jvanremoortere Joris Van Remoortere
              Reporter:
              cdoyle Connor Doyle
              Shepherd:
              Niklas Quarfot Nielsen
            • Votes:
              4 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development