Jetspeed 2
  1. Jetspeed 2
  2. JS2-666

Clustered Environment: constraint violation if clones are started at the same time

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.1.3, 2.2.0
    • Component/s: Portlet Registry
    • Labels:
      None
    • Environment:
      Websphere Application Server 6.0
      Database DB2 8.2

      Description

      Clustered Environment: constraint violation if clones are started at the same time.

      Exception thrown:

      com.ibm.websphere.ce.cm.DuplicateKeyException: [IBM][CLI Driver][DB2/6000] SQL0803N One or more values in the INSERT statement, UPDATE statement, or foreign key update caused by a DELETE statement are not valid because the primary key, unique constraint or unique index identified by "2" constrains table "PORTLET_APPLICATION" from having duplicate rows for those columns. SQLSTATE=23505

      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java(Compiled Code))
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java(Compiled Code))
      at java.lang.reflect.Constructor.newInstance(Constructor.java(Compiled Code))
      at com.ibm.websphere.rsadapter.GenericDataStoreHelper.mapExceptionHelper(GenericDataStoreHelper.java:502)
      at com.ibm.websphere.rsadapter.GenericDataStoreHelper.mapException(GenericDataStoreHelper.java:545)
      at com.ibm.ws.rsadapter.jdbc.WSJdbcUtil.mapException(WSJdbcUtil.java:902)
      at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.executeUpdate(WSJdbcPreparedStatement.java:555)
      at org.apache.ojb.broker.accesslayer.JdbcAccessImpl.executeInsert(JdbcAccessImpl.java:216)
      at org.apache.ojb.broker.core.PersistenceBrokerImpl.storeToDb(PersistenceBrokerImpl.java:1754)
      at org.apache.ojb.broker.core.PersistenceBrokerImpl.store(PersistenceBrokerImpl.java:813)
      at org.apache.ojb.broker.core.PersistenceBrokerImpl.store(PersistenceBrokerImpl.java:726)

      1. jetspeed-JS2-666-patch.diff
        21 kB
        Joachim Müller
      2. jetspeed-JS2-666-patch2.diff
        23 kB
        Ate Douma

        Activity

        Hide
        Joachim Müller added a comment -

        The problem here is that the (re)registering defined in

        PortletApplicationManager.registerPortletApplication(...)

        is not encapsulated in one and only one transaction and the transactions are not blocking other clusters. To change the data of the PortletApplication it uses methods of the PersistenceBrokerPortletRegistry which are encapsulated by transactions for removing and creating a portlet application.

        i.e.

        PersistenceBrokerPortletRegistry.registerPortletApplication(PortletApplicationDefinition)
        PersistenceBrokerPortletRegistry.removeApplication(PortletApplicationDefinition)

        Since (re)registering removes and inserts data from the database which is not fully encapsulated by one transaction and not write locked, there maybe conflicts. A sample:

        A = cluster node 1
        B = cluster node 2

        • A removes PA from DB
        • B removes PA from DB again (with no effect)
        • A inserts PA into DB
        • B inserts PA into DB (exists with duplicat key constraints violation)

        What would be the options:

        1.) Make sure only one cluster node can (re)deploy the portlet application at once.

        A first approach could be be:

        • delete and insert should only be executed, if not executed yet by another cluster node
        • to synchronize add a kind of "monitor" to the database (e.g. new table with monitoring "flag" and optimistic locking)
        • every cluster node checks the monitor
        • if monitor not set, the cluster node sets it and executes the deletion/insert stuff
        • if monitor set, the cluster node waits until monitor is "free" and only reloads the registry (with the already written Portlet Application by the other cluster node)
        • if both cluster nodes want to update the monitor, optimistic locking leads to an exception on one side. that side then also should wait and reload
        • make sure the cluster node retries to (re)deploy the portlet application on exception (see 2.))

        2.) Catch the exception, roll back and keep on trying to (re)deploy the porlet.xml

        I am not sure if this is a good solution because multiple transactions on multiple cluster nodes could produce invalid data in the database tables or deadlocks? (I am not an clustered eviroment database expert )

        3.) change the (re) deploy process:

        • avoid deletion of the portlet application
        • step trough the object tree and insert/update only if necessary
        • combine this with optimistic locking (requires data model change)

        4.) another slick solution that makes everything much easier (maybe at OJB level?)

        I would like to synchronize with the core developers before starting to implement a solution. What do you think?

        The quickest solution for now with the least impact on data model and code base would be 2.), but I am not sure if this is a really robust solution. Please comment.

        To generally avoid problems in clustered environments we maybe have to change some aspects of the database access via OJB as stated in :

        http://db.apache.org/ojb/docu/howtos/howto-work-with-clustering.html
        http://db.apache.org/ojb/docu/guides/lockmanager.html#LockManagerRemoteImpl

        Show
        Joachim Müller added a comment - The problem here is that the (re)registering defined in PortletApplicationManager.registerPortletApplication(...) is not encapsulated in one and only one transaction and the transactions are not blocking other clusters. To change the data of the PortletApplication it uses methods of the PersistenceBrokerPortletRegistry which are encapsulated by transactions for removing and creating a portlet application. i.e. PersistenceBrokerPortletRegistry.registerPortletApplication(PortletApplicationDefinition) PersistenceBrokerPortletRegistry.removeApplication(PortletApplicationDefinition) Since (re)registering removes and inserts data from the database which is not fully encapsulated by one transaction and not write locked, there maybe conflicts. A sample: A = cluster node 1 B = cluster node 2 A removes PA from DB B removes PA from DB again (with no effect) A inserts PA into DB B inserts PA into DB (exists with duplicat key constraints violation) What would be the options: 1.) Make sure only one cluster node can (re)deploy the portlet application at once. A first approach could be be: delete and insert should only be executed, if not executed yet by another cluster node to synchronize add a kind of "monitor" to the database (e.g. new table with monitoring "flag" and optimistic locking) every cluster node checks the monitor if monitor not set, the cluster node sets it and executes the deletion/insert stuff if monitor set, the cluster node waits until monitor is "free" and only reloads the registry (with the already written Portlet Application by the other cluster node) if both cluster nodes want to update the monitor, optimistic locking leads to an exception on one side. that side then also should wait and reload make sure the cluster node retries to (re)deploy the portlet application on exception (see 2.)) 2.) Catch the exception, roll back and keep on trying to (re)deploy the porlet.xml I am not sure if this is a good solution because multiple transactions on multiple cluster nodes could produce invalid data in the database tables or deadlocks? (I am not an clustered eviroment database expert ) 3.) change the (re) deploy process: avoid deletion of the portlet application step trough the object tree and insert/update only if necessary combine this with optimistic locking (requires data model change) 4.) another slick solution that makes everything much easier (maybe at OJB level?) I would like to synchronize with the core developers before starting to implement a solution. What do you think? The quickest solution for now with the least impact on data model and code base would be 2.), but I am not sure if this is a really robust solution. Please comment. To generally avoid problems in clustered environments we maybe have to change some aspects of the database access via OJB as stated in : http://db.apache.org/ojb/docu/howtos/howto-work-with-clustering.html http://db.apache.org/ojb/docu/guides/lockmanager.html#LockManagerRemoteImpl
        Hide
        Joachim Müller added a comment -

        I have attached a patch that addresses the problem.

        It solves the problem as follows:

        1.) It introduces an (optional) configurable maxRetriedStart Parameter (defaults to 10). This parameter defines how often the PA (portlet application) manager will try to restart a PA on error.

        2.) A PA registration error on startup does not lead into NOT register the PA anymore. The descriptor change monitor is always started for the PA, also in case of an registration error.

        3.) The description change monitor tries to start the PA if

        a.) the PA descriptors have changed OR
        b.) the previous start of the PA was unsuccessful, as long as the number of unsuccessful starts does not exceed maxRetriedStart (defaults to 10)

        This means that in a cluster (we presume identical portlet descriptors here) the cluster nodes can "delay" the PA registration if the node encounters registration problems (like the described constraint violation). If the problem is not recoverable (portlet.xml is destroyed) in deactivated the re-registration after a number of retries (but restarts registration on PA descriptor changes).

        Still the registration of the PA and the synchronization of the PA descriptors with the database is based on the picked up changes of the (file based) PA descriptors. The cluster nodes will not pick up changes of the PA introduced by another cluster node as long as they are not restarted.

        Show
        Joachim Müller added a comment - I have attached a patch that addresses the problem. It solves the problem as follows: 1.) It introduces an (optional) configurable maxRetriedStart Parameter (defaults to 10). This parameter defines how often the PA (portlet application) manager will try to restart a PA on error. 2.) A PA registration error on startup does not lead into NOT register the PA anymore. The descriptor change monitor is always started for the PA, also in case of an registration error. 3.) The description change monitor tries to start the PA if a.) the PA descriptors have changed OR b.) the previous start of the PA was unsuccessful, as long as the number of unsuccessful starts does not exceed maxRetriedStart (defaults to 10) This means that in a cluster (we presume identical portlet descriptors here) the cluster nodes can "delay" the PA registration if the node encounters registration problems (like the described constraint violation). If the problem is not recoverable (portlet.xml is destroyed) in deactivated the re-registration after a number of retries (but restarts registration on PA descriptor changes). Still the registration of the PA and the synchronization of the PA descriptors with the database is based on the picked up changes of the (file based) PA descriptors. The cluster nodes will not pick up changes of the PA introduced by another cluster node as long as they are not restarted.
        Hide
        Joachim Müller added a comment -

        The patch

        Show
        Joachim Müller added a comment - The patch
        Hide
        Ate Douma added a comment -

        As I had to apply the provided patch by hand to be able to review it, I made a new one.

        Note: I haven't yet reviewed the patch yet, but plan to do so later this weekend / early next week.

        Show
        Ate Douma added a comment - As I had to apply the provided patch by hand to be able to review it, I made a new one. Note: I haven't yet reviewed the patch yet, but plan to do so later this weekend / early next week.
        Hide
        Ate Douma added a comment - - edited

        Reviewing the new patch I made myself earlier it turned out I made a few errors when applying Joachim's patch by hand.
        After fixing those, this patch tested out very well and definitely provides some improvements as well as a more robust handling for clustered environments.

        So, for the record, I'm attaching a fixed version of the patch I created and then will commit this to both the 2.2 trunk and the 2.1.3 branch to which I actually tested this one out.

        Show
        Ate Douma added a comment - - edited Reviewing the new patch I made myself earlier it turned out I made a few errors when applying Joachim's patch by hand. After fixing those, this patch tested out very well and definitely provides some improvements as well as a more robust handling for clustered environments. So, for the record, I'm attaching a fixed version of the patch I created and then will commit this to both the 2.2 trunk and the 2.1.3 branch to which I actually tested this one out.
        Hide
        Ate Douma added a comment -

        Patch applied to both the 2.1.3 branch and the 2.2 trunk (I needed to merge the changes for JS2-799 first to be able to)

        Show
        Ate Douma added a comment - Patch applied to both the 2.1.3 branch and the 2.2 trunk (I needed to merge the changes for JS2-799 first to be able to)

          People

          • Assignee:
            Ate Douma
            Reporter:
            Frank Stalherm
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development