Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2161

AbstractState JNI check fails for Marathon framework

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.22.1, 0.23.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Mesos 0.21.0
      Marathon 0.7.5
      Fedora 20

    • Target Version/s:
    • Sprint:
      Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3

      Description

      I've recently upgraded to mesos 0.21.0 and now it seems that every few minutes or so I see the following error, which kills marathon.

      Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,064] INFO 10.133.128.26 - - [26/Nov/2014:00:12:41 +0000] "GET /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0" (mesosphere.chaos.http.ChaosRequestLog:15)
      Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,238] INFO 10.133.128.26 - - [26/Nov/2014:00:12:42 +0000] "GET /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0" (mesosphere.chaos.http.ChaosRequestLog:15)
      Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,961] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +0000] "GET /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537...
      Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:43,032] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +0000] "GET /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari...
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 18:12:44.146260 5897 check.hpp:79] Check failed: f.isReady()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check failure stack trace: ***
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b17c google::LogMessage::Fail()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b0d5 google::LogMessage::SendToLog()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2aab3 google::LogMessage::Flush()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2da3b google::LogMessageFatal::~LogMessageFatal()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1ea64 _checkReady<>()
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1d43b Java_org_apache_mesos_state_AbstractState__1_1names_1get
      Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f81f644ca70 (unknown)
      Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: main process exited, code=killed, status=6/ABRT

      Here's the command that mesos-master is being run with
      /usr/local/sbin/mesos-master --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos

      Here's the command that the slave is running with:
      /usr/local/sbin/mesos-slave --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago

      I realize this could also be filed to marathon, but it sort of looks like a c++ issue to me, which is why I came here to post this. Any help would be greatly appreciated.

        Issue Links

          Activity

          Hide
          adam-mesos Adam B added a comment -

          Looks like it could be a bug in the Mesos Java/JNI bindings, specifically a CHECK_READY that fails here: https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_state_AbstractState.cpp#L696
          Or perhaps a change in the way Marathon is using the state abstraction.

          Show
          adam-mesos Adam B added a comment - Looks like it could be a bug in the Mesos Java/JNI bindings, specifically a CHECK_READY that fails here: https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_state_AbstractState.cpp#L696 Or perhaps a change in the way Marathon is using the state abstraction.
          Hide
          cdoyle Connor Doyle added a comment -

          Matthew Sanders have you observed the symptoms of MESOS-1795 as well?

          Show
          cdoyle Connor Doyle added a comment - Matthew Sanders have you observed the symptoms of MESOS-1795 as well?
          Hide
          fullung Albert Strasheim added a comment -

          Hitting this quite regularly with Marathon 0.7.5 and Marathon 0.21.

          java[12056]: F0211 12:27:27.873663 12092 check.hpp:79] Check failed: f.isReady()
          java[12056]: *** Check failure stack trace: ***
          java[12056]: @     0x7f2a6eae3090  google::LogMessage::Fail()
          java[12056]: @     0x7f2a6eae2ff0  google::LogMessage::SendToLog()
          java[12056]: @     0x7f2a6eae29f0  google::LogMessage::Flush()
          java[12056]: @     0x7f2a6eae5775  google::LogMessageFatal::~LogMessageFatal()
          java[12056]: @     0x7f2a6ead6bda  _checkReady<>()
          java[12056]: @     0x7f2a6ead4789  Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
          java[12056]: @     0x7f2b4949ee90  (unknown)
          systemd[1]: marathon.service: main process exited, code=killed, status=6/ABRT
          
          Show
          fullung Albert Strasheim added a comment - Hitting this quite regularly with Marathon 0.7.5 and Marathon 0.21. java[12056]: F0211 12:27:27.873663 12092 check.hpp:79] Check failed: f.isReady() java[12056]: *** Check failure stack trace: *** java[12056]: @ 0x7f2a6eae3090 google::LogMessage::Fail() java[12056]: @ 0x7f2a6eae2ff0 google::LogMessage::SendToLog() java[12056]: @ 0x7f2a6eae29f0 google::LogMessage::Flush() java[12056]: @ 0x7f2a6eae5775 google::LogMessageFatal::~LogMessageFatal() java[12056]: @ 0x7f2a6ead6bda _checkReady<>() java[12056]: @ 0x7f2a6ead4789 Java_org_apache_mesos_state_AbstractState__1_1fetch_1get java[12056]: @ 0x7f2b4949ee90 (unknown) systemd[1]: marathon.service: main process exited, code=killed, status=6/ABRT
          Hide
          swartzrock Jason Swartz added a comment -

          GDB investigation of the core dump from Marathon including registers.

          Show
          swartzrock Jason Swartz added a comment - GDB investigation of the core dump from Marathon including registers.
          Hide
          jvanremoortere Joris Van Remoortere added a comment -

          We have figured out what the issue is. Niklas Quarfot Nielsen has a quick patch that almost entirely removes the problem, and we will have a follow up patch that fully fixes this issue (the delay is to ensure we maintain version upgrade compatibility).

          The pattern that is being used to manage the memory allocated in JNI is to map the java 'finalize' function to a clean-up function in C++ that calls the object destructor. This pattern relies on proper reference counting of the java object, in this case the 'Future'.

          In 0.21 and before, there was a subtle mistake:
          What we thought was happening: The 'Future' was being kept alive during the invocation of the JNI calls, because the JNI calls took a 'thiz' object
          What was actually happening: The 'thiz' jobject was actually referring to the outer class 'AbstractState' because that is where the member functions are defined.
          Why this was around for so long / hard to catch: The inter-play (race) between the garbage collector and completion time of the '__fetch_get_timeout()' routine meant that only under sufficient pressure would there be a garbage collection that triggered a 'finalize()' on a Future that was still being 'waited' on.

          The quick fix:
          Take a copy (as opposed to keeping a pointer) of the future upon entering the JNI function, so that even if the garbage collector finalizes the 'Future' during the execution of the JNI function, there is still a valid C++ reference count on the C++ 'Future' object. The strategy for this is the minimize the surface area of the bug such that only a garbage collection between function invocation and the copy would cause a problem. Prior to this fix, due to the blocking nature of the 'Future::get()' functions, there was ample surface area.

          The long term fix:
          Pass the correct jobject to the JNI function such that the object is not finalized prematurely. This requires some massaging of the API, and is more complicated due to our upgrade compatibility requirements.

          I will post a later review for the long term fix.

          Show
          jvanremoortere Joris Van Remoortere added a comment - We have figured out what the issue is. Niklas Quarfot Nielsen has a quick patch that almost entirely removes the problem, and we will have a follow up patch that fully fixes this issue (the delay is to ensure we maintain version upgrade compatibility). The pattern that is being used to manage the memory allocated in JNI is to map the java 'finalize' function to a clean-up function in C++ that calls the object destructor. This pattern relies on proper reference counting of the java object, in this case the 'Future'. In 0.21 and before, there was a subtle mistake: What we thought was happening: The 'Future' was being kept alive during the invocation of the JNI calls, because the JNI calls took a 'thiz' object What was actually happening: The 'thiz' jobject was actually referring to the outer class 'AbstractState' because that is where the member functions are defined. Why this was around for so long / hard to catch: The inter-play (race) between the garbage collector and completion time of the '__fetch_get_timeout()' routine meant that only under sufficient pressure would there be a garbage collection that triggered a 'finalize()' on a Future that was still being 'waited' on. The quick fix: Take a copy ( as opposed to keeping a pointer ) of the future upon entering the JNI function, so that even if the garbage collector finalizes the 'Future' during the execution of the JNI function, there is still a valid C++ reference count on the C++ 'Future' object. The strategy for this is the minimize the surface area of the bug such that only a garbage collection between function invocation and the copy would cause a problem. Prior to this fix, due to the blocking nature of the 'Future::get()' functions, there was ample surface area. The long term fix: Pass the correct jobject to the JNI function such that the object is not finalized prematurely. This requires some massaging of the API, and is more complicated due to our upgrade compatibility requirements. I will post a later review for the long term fix.
          Hide
          benjaminhindman Benjamin Hindman added a comment -

          commit 5d0b5fdb5d8d78b44537cb01916263a3769b5d7d
          Author: Joris Van Remoortere <joris.van.remoortere@gmail.com>
          Date: Sun Mar 29 12:20:24 2015 -0700

          Fix memory corruption in AbstractState JNI bindings. MESOS-2161.

          Review: https://reviews.apache.org/r/32152

          Show
          benjaminhindman Benjamin Hindman added a comment - commit 5d0b5fdb5d8d78b44537cb01916263a3769b5d7d Author: Joris Van Remoortere <joris.van.remoortere@gmail.com> Date: Sun Mar 29 12:20:24 2015 -0700 Fix memory corruption in AbstractState JNI bindings. MESOS-2161 . Review: https://reviews.apache.org/r/32152

            People

            • Assignee:
              jvanremoortere Joris Van Remoortere
              Reporter:
              mattomatic Matthew Sanders
              Shepherd:
              Benjamin Hindman
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development

                  Agile