ActiveMQ
  1. ActiveMQ
  2. AMQ-1352 Shutdown fails if DB connection is lost
  3. AMQ-1350

JDBC master/slave does not work properly with datasources that can reconnect to the database

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 5.2.0
    • Fix Version/s: NEEDS_REVIEW
    • Component/s: Message Store
    • Labels:
      None
    • Environment:

      Linux x86_64, Sun jdk 1.6, Postgresql 8.2.4, c3p0 or other pooling datasources

    • Patch Info:
      Patch Available

      Description

      This problem involves the JDBC master/slave configuration when the db server is restarted, or when the brokers lose their JDBC connections for whatever reason temporarily, and when a datasource is in use that can re-establish stale connections prior to providing them to the broker.

      The problem lies with the JDBC locking strategy used to determine which broker is master and which are slaves. Let's say there are two brokers, a master and a slave, and they've successfully initialized. If you restart the database server, the slave will throw an exception because it's just caught an exception while blocked attempting to get the lock. The slave will then retry the process of getting a lock over and over again. Now, since the database was bounced, the master will have lost its lock in the activemq_lock table. However, with the current 4.x-5.x code, it will never "know" that it has lost the lock. There is no mechanism to check the lock state. So it will continue to think that it is the master and will leave all of its network connectors active.

      When the slave tries to acquire the lock now, if the datasource has restored connections to the now-restarted database server, it will succeed. The slave will come up as master, and there will be two masters active concurrently. Both masters should at this point be fully-functional, as both will have datasources that can talk to the database server once again.

      I have tested this with c3p0 and verified that I get two masters after bouncing the database server. If, at that point, I kill the original slave broker, the original master still appears to be functioning normally. If, instead, I kill the original master broker, messages are still delivered via the original slave (now co-master). It does not seem to matter which broker the clients connect to - both work.

      There is no workaround that I can think of that would function correctly across multiple database bounces. If a slave's datasource does not have the functionality to do database reconnects, then, after the first database server restart, it will never be able to establish a connection to the db server in order to attempt to acquire the lock. This, combined with the fact that the JDBC master/slave topology does not have any favored brokers – all can be masters or slaves depending on start-up order and the failures that have occurred over time, means that a datasource that can do reconnects is required on all brokers. Therefore it would seem that in the JDBC masters/slave topology a database restart or temporary loss of database connectivity will always result in multiple masters.

        Activity

        Jeff Turner made changes -
        Project Import Fri Nov 26 22:32:02 EST 2010 [ 1290828722158 ]
        Rob Davies made changes -
        Fix Version/s NEEDS_REVIEWED [ 12186 ]
        Fix Version/s 5.4.0 [ 12110 ]
        Bruce Snyder made changes -
        Fix Version/s 5.4.0 [ 12110 ]
        Fix Version/s NEED_REVIEWED [ 12186 ]
        Bruce Snyder made changes -
        Link This issue is related to AMQ-1352 [ AMQ-1352 ]
        Bruce Snyder made changes -
        Fix Version/s AGING_TO_DIE [ 12187 ]
        Patch Info [Patch Available]
        Fix Version/s NEED_REVIEWED [ 12186 ]
        Bruce Snyder made changes -
        Fix Version/s 5.4.0 [ 12110 ]
        Fix Version/s AGING_TO_DIE [ 12187 ]
        Gary Tully made changes -
        Fix Version/s 5.4.0 [ 12110 ]
        Fix Version/s 5.3.0 [ 11914 ]
        Hide
        Mario Siegenthaler added a comment -

        Yes, the patch for AMQ-1885 helps a lot, I've yet to encounter a situation where it fails to work. My patch was mostly about enabling the lockKeepAlivePeriod.

        The only thing that does not work for me is a correct broker shutdown. The broker kinda quits itself but then gives a "failed to stop broker" and leaves the VM running. This is a bit annoying because we could just auto-restart the vm if it did terminate properly. This way we have to go through ugly log parsing and process killing in the shell script. My patch fixed the shutdown for the then-trunk, I was rather trivial to do (just catch&log/ignore some exceptions).

        I propose to close this issue and open a new one regarding the failing broker shutdown (5.1) on db-failure [also happens when i.e. the transaction log is full].

        Show
        Mario Siegenthaler added a comment - Yes, the patch for AMQ-1885 helps a lot, I've yet to encounter a situation where it fails to work. My patch was mostly about enabling the lockKeepAlivePeriod. The only thing that does not work for me is a correct broker shutdown. The broker kinda quits itself but then gives a "failed to stop broker" and leaves the VM running. This is a bit annoying because we could just auto-restart the vm if it did terminate properly. This way we have to go through ugly log parsing and process killing in the shell script. My patch fixed the shutdown for the then-trunk, I was rather trivial to do (just catch&log/ignore some exceptions). I propose to close this issue and open a new one regarding the failing broker shutdown (5.1) on db-failure [also happens when i.e. the transaction log is full] .
        Hide
        Gary Tully added a comment -

        the fix for https://issues.apache.org/activemq/browse/AMQ-1885 should help a bit here. The slave will retry and the master will fail in the event of a db outage. On db restart, the slave, still being alive, should become the master.
        There is a test case with the change that may provide a template for a test case for this issue.
        In your scenario it seems odd that there are two masters, this points to a problem with the lock statements for your test db.
        note the org.apache.activemq.store.jdbc.DefaultDatabaseLocker, which attempts to maintain an acquired lock by updating a table entry, it has a configurable lockKeepAlivePeriod.
        Would it be possible to revisit your patch in the light of the current trunk.

        Show
        Gary Tully added a comment - the fix for https://issues.apache.org/activemq/browse/AMQ-1885 should help a bit here. The slave will retry and the master will fail in the event of a db outage. On db restart, the slave, still being alive, should become the master. There is a test case with the change that may provide a template for a test case for this issue. In your scenario it seems odd that there are two masters, this points to a problem with the lock statements for your test db. note the org.apache.activemq.store.jdbc.DefaultDatabaseLocker, which attempts to maintain an acquired lock by updating a table entry, it has a configurable lockKeepAlivePeriod. Would it be possible to revisit your patch in the light of the current trunk.
        Gary Tully made changes -
        Fix Version/s 5.3.0 [ 11914 ]
        Fix Version/s 5.2.0 [ 11841 ]
        Hide
        Manish Bellani added a comment -

        How About using DistributedLock from Jgroups to make the master/slave work? or something similar.

        Show
        Manish Bellani added a comment - How About using DistributedLock from Jgroups to make the master/slave work? or something similar.
        Rob Davies made changes -
        Fix Version/s 5.2.0 [ 11841 ]
        Hide
        Mario Siegenthaler added a comment -

        I did some further research on this topic. Here's what I'm going for:
        a) Lock something (f.e. the lock table) on startup
        + on success: goto b
        + else: try to lock until you succeed (repeat a)
        b) Start the broker and a keep alive thread (executed every x seconds -> c)
        c) Check if we still do hold the lock (and that the db is still there)
        + if we do: wait till next keepalive, then execute c)
        + else: d)
        d) Shut down the broker because there's another master running

        Now the tricky part of this idea is actually set c), because there's no possibility to express "go see if you can lock that row/table/whatever and return immediately if it's already locked" (something like a tryLockNoWait). There isn't even a standard way to express a lock-wait-timeout.
        While it's possible to simulate a lock timeout (f.e. terminate the query after 5s and consider the table locked by another party) this is an unclean and IMO risky approach.

        I can offer a solution for three database systems:

        • MySQL: select get_lock("my_activemq_lock", 0); does exactly what I want to do. I doesn't use the lock-table.
        • MS SQL-Server: select * from activemq_lock where id=1 with readpast would skip the row if it's locked without waiting, so we can look at the result count. The same should also be possible with an update statement.
        • Oracle: Is supposed to have the same feature as sql-server although with a slightly different syntax.

        My research for a DB2 solution was without success, the others I didn't try yet.

        Any feedback on this solution?

        Show
        Mario Siegenthaler added a comment - I did some further research on this topic. Here's what I'm going for: a) Lock something (f.e. the lock table) on startup + on success: goto b + else: try to lock until you succeed (repeat a) b) Start the broker and a keep alive thread (executed every x seconds -> c) c) Check if we still do hold the lock (and that the db is still there) + if we do: wait till next keepalive, then execute c) + else: d) d) Shut down the broker because there's another master running Now the tricky part of this idea is actually set c), because there's no possibility to express "go see if you can lock that row/table/whatever and return immediately if it's already locked" (something like a tryLockNoWait). There isn't even a standard way to express a lock-wait-timeout. While it's possible to simulate a lock timeout (f.e. terminate the query after 5s and consider the table locked by another party) this is an unclean and IMO risky approach. I can offer a solution for three database systems: MySQL: select get_lock("my_activemq_lock", 0); does exactly what I want to do. I doesn't use the lock-table. MS SQL-Server: select * from activemq_lock where id=1 with readpast would skip the row if it's locked without waiting, so we can look at the result count. The same should also be possible with an update statement. Oracle: Is supposed to have the same feature as sql-server although with a slightly different syntax. My research for a DB2 solution was without success, the others I didn't try yet. Any feedback on this solution?
        Hide
        Mario Siegenthaler added a comment -

        Note that the patch will not try to reaquire the lock, it'll just check if nobody else holds the lock and shutdown in that case. We could also try to check if we still hold the lock and update it if neccessary. However I fear that doing a SELECT FOR UPDATE every x seconds will kill/slowdown the database because it'll result in 1000s of lock. Or does the database realize that we already have locked this thing and the statement is a no-op lockingwise? Also is there a portable way to check for a existing lock without being blocked in that case?

        Show
        Mario Siegenthaler added a comment - Note that the patch will not try to reaquire the lock, it'll just check if nobody else holds the lock and shutdown in that case. We could also try to check if we still hold the lock and update it if neccessary. However I fear that doing a SELECT FOR UPDATE every x seconds will kill/slowdown the database because it'll result in 1000s of lock. Or does the database realize that we already have locked this thing and the statement is a no-op lockingwise? Also is there a portable way to check for a existing lock without being blocked in that case?
        Mario Siegenthaler made changes -
        Field Original Value New Value
        Attachment activemq-master-slave.patch [ 15631 ]
        Hide
        Mario Siegenthaler added a comment -

        Patch for this issue. You can now specify a keep-alive/check period for the lock on the database. If the lock is lost then the broker is shut down.
        Example configuration:
        <persistenceAdapter>
        <jdbcPersistenceAdapter dataSource="#mysql-ds" lockKeepAlivePeriod="1000"/>
        </persistenceAdapter>

        Detail content:

        • Expose the already existing lock (added getter and setter)
        • Fixed the startup of the PersistenceAdapter by the BrokerService. The configureService() method wasn't executed when configured via xml.
        • Fixed some smaller things within JDBCPersistence (mostly error handling stuff in the shutdown case)
        • Better handling of absent database-locking (configuration-flag). Introduced a NoLock-Locker to avoid having to check for the flag all over the place.
        • Moved the INSERT-row into lock table from the database setup to the lock-aquire. Reason: This statement executed on the slave (db already locked) will block resulting in the missing "Attempting to acquire the exclusive lock to become the Master broker" message. (this fix is not directly related to this issue)
        Show
        Mario Siegenthaler added a comment - Patch for this issue. You can now specify a keep-alive/check period for the lock on the database. If the lock is lost then the broker is shut down. Example configuration: <persistenceAdapter> <jdbcPersistenceAdapter dataSource="#mysql-ds" lockKeepAlivePeriod="1000"/> </persistenceAdapter> Detail content: Expose the already existing lock (added getter and setter) Fixed the startup of the PersistenceAdapter by the BrokerService. The configureService() method wasn't executed when configured via xml. Fixed some smaller things within JDBCPersistence (mostly error handling stuff in the shutdown case) Better handling of absent database-locking (configuration-flag). Introduced a NoLock-Locker to avoid having to check for the flag all over the place. Moved the INSERT-row into lock table from the database setup to the lock-aquire. Reason: This statement executed on the slave (db already locked) will block resulting in the missing "Attempting to acquire the exclusive lock to become the Master broker" message. (this fix is not directly related to this issue)
        Hide
        Mario Siegenthaler added a comment -

        We've also expired this behavior on a 4.1.1 master/slave configuration using SQL-Server. The master has somehow lost the lock during a database maintance operation (we suspect some DB-admin killed the lock in order to be able to backup the database) and we ended up with two masters.

        Show
        Mario Siegenthaler added a comment - We've also expired this behavior on a 4.1.1 master/slave configuration using SQL-Server. The master has somehow lost the lock during a database maintance operation (we suspect some DB-admin killed the lock in order to be able to backup the database) and we ended up with two masters.
        Eric Anderson created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Eric Anderson
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development