ActiveMQ
  1. ActiveMQ
  2. AMQ-3654

JDBC Master/Slave : Slave cannot acquire lock when the master loose database connection.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 5.5.0
    • Fix Version/s: 5.7.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Unix/Redhat 5.6
      ActiveMQ 5.5.0
      Oracle 10G

      Description

      Our configuration is JDBC Master/Slave with one master and one slave. When the master is started, he acquire the database lock.
      Then when the slave is started, he wait to acquire the database lock. When the master loose the network connection to the database, the lock in the database is not removed and the slave connot acquire the database lock. In this situation, the master is unable to respond to client (due to network failure)
      and the slave is not started because he can't acquire the database lock.

      When the master is killed, the slave can't acquire the database lock too. After the network connection is restored, when the master starts, it cannot
      acquire lock to the database (because the lod lock is always present) so now, we have two slaves and no master.

      Please, refer to this issue which is the same problem : AMQ-1958

        Issue Links

          Activity

          Hide
          SuoNayi added a comment -

          I guess When the master loose the network connection to the database,the underlying jdbc connection can not detect the network connection is broken in time.
          You can try to set the value of property lockKeepAlivePeriod of jdbcPersistenceAdapter to be smaller than default value(default 30s).
          If it works please let me know,thanks.

          Show
          SuoNayi added a comment - I guess When the master loose the network connection to the database,the underlying jdbc connection can not detect the network connection is broken in time. You can try to set the value of property lockKeepAlivePeriod of jdbcPersistenceAdapter to be smaller than default value(default 30s). If it works please let me know,thanks.
          Hide
          Richard Martin added a comment -

          Thanks for the suggestion, i tried to set the property lockKeepAlivePeriod to 3s with this configuration :

          <jdbcPersistenceAdapter lockKeepAlivePeriod="3000" dataDirectory="$

          {activemq.base}

          /data" dataSource="#oracle-ds" />

          But it didn't work. When the network connection on the master is lost, the lock is still present in the database. Any other idea which can solve the network failure test ?

          Do you think, i can use the jdbc mode configured with useDatabaseLock="false" ? It seem to be a master/master solution. We need to have a HA Solution with no message lost. With the shared database, the message are stored in the same place. So if the master1 fail, the message is stored in the database and our client switch to the master2 and he can consume the message. Is it correct or i must use a Jdbc master/slave with useDatabaseLock="true" to have a HA solution ?

          Show
          Richard Martin added a comment - Thanks for the suggestion, i tried to set the property lockKeepAlivePeriod to 3s with this configuration : <jdbcPersistenceAdapter lockKeepAlivePeriod="3000" dataDirectory="$ {activemq.base} /data" dataSource="#oracle-ds" /> But it didn't work. When the network connection on the master is lost, the lock is still present in the database. Any other idea which can solve the network failure test ? Do you think, i can use the jdbc mode configured with useDatabaseLock="false" ? It seem to be a master/master solution. We need to have a HA Solution with no message lost. With the shared database, the message are stored in the same place. So if the master1 fail, the message is stored in the database and our client switch to the master2 and he can consume the message. Is it correct or i must use a Jdbc master/slave with useDatabaseLock="true" to have a HA solution ?
          Hide
          SuoNayi added a comment -

          Seem it's the business for oracle to handle this situation because it doest not detect the broken connection and release the row lock in time.
          You can consult your DBA for more advice.

          Show
          SuoNayi added a comment - Seem it's the business for oracle to handle this situation because it doest not detect the broken connection and release the row lock in time. You can consult your DBA for more advice.
          Hide
          metatech added a comment -

          If the database connection is abruptly lost (no TCP FIN nor RST), for instance when the network cable is unplugged, the DB client application cannot release the lock anymore.
          The DB server itself needs to detect the dead connection.
          On Oracle the parameter "SQLNET.EXPIRE_TIME" can be used.

          Show
          metatech added a comment - If the database connection is abruptly lost (no TCP FIN nor RST), for instance when the network cable is unplugged, the DB client application cannot release the lock anymore. The DB server itself needs to detect the dead connection. On Oracle the parameter "SQLNET.EXPIRE_TIME" can be used.
          Hide
          kimm king added a comment -

          metatech's suggestion is nice.

          Exclusive Lock in this point is magic and dangerous way.
          Under certain conditions, this error will occur.
          And if a storage do not support the lock like this, sadness happens.

          Show
          kimm king added a comment - metatech's suggestion is nice. Exclusive Lock in this point is magic and dangerous way. Under certain conditions, this error will occur. And if a storage do not support the lock like this, sadness happens.
          Hide
          ranpeng added a comment -

          i use the mysql database to store messages ,and i met the same problem. When the master loose the network connection to the database, the lock in the database is not removed and the slave connot acquire the database lock.after the network recovery connection ,the master still can't acquire lock to the database so now, we have two slaves and no master.period of time , the master process automatically exit,remaining a slave.Any other idea which can solve the network failure test ?

          Show
          ranpeng added a comment - i use the mysql database to store messages ,and i met the same problem. When the master loose the network connection to the database, the lock in the database is not removed and the slave connot acquire the database lock.after the network recovery connection ,the master still can't acquire lock to the database so now, we have two slaves and no master.period of time , the master process automatically exit,remaining a slave.Any other idea which can solve the network failure test ?
          Hide
          Gary Tully added a comment -

          Added a lease based data base locker. Use as follows from xml config with the 5.7-SNAPSHOT

                 <ioExceptionHandler>
                      <jDBCIOExceptionHandler/>
                  </ioExceptionHandler>
          
                  <persistenceAdapter>
                      <jdbcPersistenceAdapter lockKeepAlivePeriod="1000" lockAcquireSleepInterval="2000">
                          <databaseLocker>
                              <lease-database-locker/>
                          </databaseLocker>
                      </jdbcPersistenceAdapter>
                  </persistenceAdapter>
          

          The optional IOExceptionHandler will pause/resume the transport connectors on any IO exception related to access to the DB.
          The lease based lock is acquired by blocking at start and retained by the keepAlivePeriod. To retain, the lease is extended by the lockAcquireSleepInterval, so in theory the master is always

          lockAcquireSleepInterval-lockKeepAlivePeriod

          ahead of the slave w.r.t the lease.
          The lease is dropped on normal shutdown.
          The broker system clock is not in sync with the db, a maxAllowableDiffFromDBTime > 0 will adjust the lease duration if the skew exceeds the absolute maxAllowableDiffFromDBTime value, allowing the db to dictate the utc basis for the lease.

          Show
          Gary Tully added a comment - Added a lease based data base locker. Use as follows from xml config with the 5.7-SNAPSHOT <ioExceptionHandler> <jDBCIOExceptionHandler/> </ioExceptionHandler> <persistenceAdapter> <jdbcPersistenceAdapter lockKeepAlivePeriod= "1000" lockAcquireSleepInterval= "2000" > <databaseLocker> <lease-database-locker/> </databaseLocker> </jdbcPersistenceAdapter> </persistenceAdapter> The optional IOExceptionHandler will pause/resume the transport connectors on any IO exception related to access to the DB. The lease based lock is acquired by blocking at start and retained by the keepAlivePeriod. To retain, the lease is extended by the lockAcquireSleepInterval, so in theory the master is always lockAcquireSleepInterval-lockKeepAlivePeriod ahead of the slave w.r.t the lease. The lease is dropped on normal shutdown. The broker system clock is not in sync with the db, a maxAllowableDiffFromDBTime > 0 will adjust the lease duration if the skew exceeds the absolute maxAllowableDiffFromDBTime value, allowing the db to dictate the utc basis for the lease.
          Hide
          Gary Tully added a comment -

          would appreciate if you could validate the efficacy of this lease based approach in your environment using the latest 5.7-SNAPSHOT

          Show
          Gary Tully added a comment - would appreciate if you could validate the efficacy of this lease based approach in your environment using the latest 5.7-SNAPSHOT
          Hide
          Gaurav Sharma added a comment -

          Thanks Gary. What is the suggested patch back-porting process in case I want to apply this patch to my v5.6 broker running the jdbc master-slave configuration against oracle? Should I just build the core jar from source after applying the changes to v5.6 code in my local svn repo?

          Show
          Gaurav Sharma added a comment - Thanks Gary. What is the suggested patch back-porting process in case I want to apply this patch to my v5.6 broker running the jdbc master-slave configuration against oracle? Should I just build the core jar from source after applying the changes to v5.6 code in my local svn repo?
          Hide
          Gary Tully added a comment -

          maybe first try a 5.7-SNAPSHOT to validate it works in your use case, but sure, an updated activemq-core is all that you will need.

          Show
          Gary Tully added a comment - maybe first try a 5.7-SNAPSHOT to validate it works in your use case, but sure, an updated activemq-core is all that you will need.
          Hide
          Gaurav Sharma added a comment -

          Thanks Gary, will first test with the 5.7-SNAPSHOT.

          Show
          Gaurav Sharma added a comment - Thanks Gary, will first test with the 5.7-SNAPSHOT.
          Hide
          Sreeni Iyer added a comment -

          +1 for the lock lease approach with the deck loaded in favor of the current leasee. That's how i have seen it work with Terracotta and many other implementations...Seems like just the implementation needs to be tested out.

          Show
          Sreeni Iyer added a comment - +1 for the lock lease approach with the deck loaded in favor of the current leasee. That's how i have seen it work with Terracotta and many other implementations...Seems like just the implementation needs to be tested out.
          Hide
          Gaurav Sharma added a comment -

          Tested 5.7-SNAPSHOT with MySQL and the patch works just fine. The BROKER_NAME also seems to be getting populated in the lock table.

          Show
          Gaurav Sharma added a comment - Tested 5.7-SNAPSHOT with MySQL and the patch works just fine. The BROKER_NAME also seems to be getting populated in the lock table.
          Hide
          SuoNayi added a comment - - edited

          For row lock based database locker, I use the same configuration file for all brokers.
          This makes deployment job simple really.
          LeaseDatabaseLocker need specify the unique lease id, by default it's broker name,
          so does this mean I have to specify different broker name for each one in the cluster?
          The last confusion is should lockAcquireSleepInterval be greater than lockKeepAlivePeriod anyway?

          Show
          SuoNayi added a comment - - edited For row lock based database locker, I use the same configuration file for all brokers. This makes deployment job simple really. LeaseDatabaseLocker need specify the unique lease id, by default it's broker name, so does this mean I have to specify different broker name for each one in the cluster? The last confusion is should lockAcquireSleepInterval be greater than lockKeepAlivePeriod anyway?
          Hide
          Gary Tully added a comment -

          @SuoNayi
          yes, either the brokerName or the lease-database-locker.leaseHolderId needs to be unique for a master/slave pair.
          And yes, lockAcquireSleepInterval > lockKeepAlivePeriod is necessary.

          Show
          Gary Tully added a comment - @SuoNayi yes, either the brokerName or the lease-database-locker.leaseHolderId needs to be unique for a master/slave pair. And yes, lockAcquireSleepInterval > lockKeepAlivePeriod is necessary.

            People

            • Assignee:
              Gary Tully
              Reporter:
              Richard Martin
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development