Geronimo
  1. Geronimo
  2. GERONIMO-4222

Database pool unusable after database unavailable for awhile

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2, 2.1.3, 2.1.4
    • Fix Version/s: None
    • Component/s: None
    • Security Level: public (Regular issues)
    • Labels:
      None
    • Environment:

      Red Hat Enterprise Linux Server v5.2

      WAS-CE v2.0.0.1, based on Geronimo v2.0.2

      Description

      I have frequent trouble with my database pool to an AS/400. The database is taken down every night for backup, and at least once a week the connection pool is unusable after the database comes back up. Restarting the connection pool makes everything work again.

      We are new to Geronimo/WAS-CE – just this one app on one server – so I don't have anything to compare to. However, we have had this same issue with a couple 1.x/1.1.x versions before we upgraded to v2. Also, there are several WebSphere (full WAS, not WAS-CE) apps that do not have this trouble.

      Configuration Info
      Driver: JTOpen v6.1 (com.ibm.as400.access.AS400JDBCDriver)
      Pool Min Size: 0
      Pool Max Size: 100
      Blocking Timeout: 5000
      Idle Timeout: 15

      1. before and after wasce restart.txt
        26 kB
        David Frahm
      2. connector.patch
        18 kB
        Jack Cai
      3. PGtrial.patch
        4 kB
        Jack Cai
      4. stacktrace.txt
        21 kB
        Thad West
      5. tranql-connector-derby-embed-local-1.5-SNAPSHOT.rar
        91 kB
        Jack Cai
      6. tranql-connector-derby-embed-xa-1.5-SNAPSHOT.rar
        91 kB
        Jack Cai
      7. tranql-connector-mysql-local-1.3-SNAPSHOT.rar
        87 kB
        Jack Cai
      8. tranql-connector-mysql-xa-1.3-SNAPSHOT.rar
        87 kB
        Jack Cai
      9. tranql-connector-postgresql-common-1.1.jar
        10 kB
        Jack Cai
      10. tranql-connector-postgresql-local-1.2-SNAPSHOT.rar
        86 kB
        Jack Cai
      11. tranql-connector-postgresql-xa-1.2-SNAPSHOT.rar
        86 kB
        Jack Cai
      12. vendors.patch
        9 kB
        Jack Cai

        Activity

        Kevan Miller made changes -
        Status Reopened [ 4 ] Closed [ 6 ]
        Fix Version/s Wish List [ 12310202 ]
        Resolution Fixed [ 1 ]
        Hide
        Kevan Miller added a comment -

        There have been multiple fixes to our connector implementation. As noted by David and Radim, we appear to have gotten things right...

        Show
        Kevan Miller added a comment - There have been multiple fixes to our connector implementation. As noted by David and Radim, we appear to have gotten things right...
        Hide
        Radim Kolar added a comment -

        This ticket should be closed as fixed. I havent seen this error too. It works as expected now.

        Show
        Radim Kolar added a comment - This ticket should be closed as fixed. I havent seen this error too. It works as expected now.
        Hide
        David Frahm added a comment -

        FYI – Since my original post opening this issue we have upgraded to WASCE 2.1.1.3, and also RHEL 5.3. Its been many months, and we have not seen the specific error I described even once.

        Show
        David Frahm added a comment - FYI – Since my original post opening this issue we have upgraded to WASCE 2.1.1.3, and also RHEL 5.3. Its been many months, and we have not seen the specific error I described even once.
        Hide
        Jack Cai added a comment -

        Thanks Forrest for trying out the patch!

        Guess the problem is because you still have some tranql-connector-1.4.jar in your repo which get loaded by the server (each vendor RAR contains a copy of this jar). Once we replace all the 1.4 jars with the updated jars, the problem shall go away.

        Show
        Jack Cai added a comment - Thanks Forrest for trying out the patch! Guess the problem is because you still have some tranql-connector-1.4.jar in your repo which get loaded by the server (each vendor RAR contains a copy of this jar). Once we replace all the 1.4 jars with the updated jars, the problem shall go away.
        Hide
        Forrest Xia added a comment -

        Results of some tries:
        1. Derby ones work
        2. PostgreSQL local one works
        3. MySQl local one works

        However, there are some strange classloader problem when I am trying application-wide datasource. If the classloader behavior keeps as default(parent first), then PostgreSQL and MySQL local datasource cannot be created, and throw an exception as below:

        org/tranql/connector/jdbc/AbstractLocalDataSourceMCF.(Ljavax/sql/ConnectionPoolDataSource;Z)V
        java.lang.NoSuchMethodError: org/tranql/connector/jdbc/AbstractLocalDataSourceMCF.(Ljavax/sql/ConnectionPoolDataSource;Z)V
        at org.tranql.connector.postgresql.PGSimpleLocalMCF.(PGSimpleLocalMCF.java:34)

        So I add <inverse-classloading/> to the deployment plan and then they work fine!

        Show
        Forrest Xia added a comment - Results of some tries: 1. Derby ones work 2. PostgreSQL local one works 3. MySQl local one works However, there are some strange classloader problem when I am trying application-wide datasource. If the classloader behavior keeps as default(parent first), then PostgreSQL and MySQL local datasource cannot be created, and throw an exception as below: org/tranql/connector/jdbc/AbstractLocalDataSourceMCF.(Ljavax/sql/ConnectionPoolDataSource;Z)V java.lang.NoSuchMethodError: org/tranql/connector/jdbc/AbstractLocalDataSourceMCF.(Ljavax/sql/ConnectionPoolDataSource;Z)V at org.tranql.connector.postgresql.PGSimpleLocalMCF.(PGSimpleLocalMCF.java:34) So I add <inverse-classloading/> to the deployment plan and then they work fine!
        Hide
        David Jencks added a comment -

        Jack, I was working on the sqlstate exception sorter and just implemented ConnectionPoolDataSource wrapping before I saw your work. I'll compare our approaches and look at the vendor wrappers tomorrow.

        Radim, I haven't looked into this in a long time, but I think that the standard behavior for uncommitted work when putting a connection back in the pool is to commit it. However, the problem we are dealing with here is that on a connection error, after removing the connection from the pool so it won't be used ever again, we are not trying to rollback work on it before destroying it. If we could tell if the connection was actually still usable, we wouldn't close it at all if the rollback could succeed... so two things to do are to improve our decision of when to close the connection (using a ConnectionPoolDataSource, and a better ExceptionSorter) and trying to rollback before destroy just in case we were wrong about needing to destroy the connection.

        Show
        David Jencks added a comment - Jack, I was working on the sqlstate exception sorter and just implemented ConnectionPoolDataSource wrapping before I saw your work. I'll compare our approaches and look at the vendor wrappers tomorrow. Radim, I haven't looked into this in a long time, but I think that the standard behavior for uncommitted work when putting a connection back in the pool is to commit it. However, the problem we are dealing with here is that on a connection error, after removing the connection from the pool so it won't be used ever again, we are not trying to rollback work on it before destroying it. If we could tell if the connection was actually still usable, we wouldn't close it at all if the rollback could succeed... so two things to do are to improve our decision of when to close the connection (using a ConnectionPoolDataSource, and a better ExceptionSorter) and trying to rollback before destroy just in case we were wrong about needing to destroy the connection.
        Jack Cai made changes -
        Jack Cai made changes -
        Jack Cai made changes -
        Attachment connector.patch [ 12420160 ]
        Attachment vendors.patch [ 12420161 ]
        Attachment tranql-connector-mysql-local-1.3-SNAPSHOT.rar [ 12420162 ]
        Hide
        Jack Cai added a comment -

        I've completed the patch that will leverage the PooledConnection interface, so pasting it here for initial review.

        The connector.patch is for the tranql-connector trunk, and the vendors.patch contains updates to the mysql, derby and pg connectors (for convenience, I created the patch from the vendor root.)

        I'm uploading the built RAR packages so hopefully somebody can help to test the new connectors. To test them with an existing Geronimo build, need to rename the RAR package to match the version used in that Geronimo build.

        Show
        Jack Cai added a comment - I've completed the patch that will leverage the PooledConnection interface, so pasting it here for initial review. The connector.patch is for the tranql-connector trunk, and the vendors.patch contains updates to the mysql, derby and pg connectors (for convenience, I created the patch from the vendor root.) I'm uploading the built RAR packages so hopefully somebody can help to test the new connectors. To test them with an existing Geronimo build, need to rename the RAR package to match the version used in that Geronimo build.
        Hide
        Radim Kolar added a comment -

        Before returning connection from client back to pool connection manager is expected to do rollback() on it, isn't it? Best method would be if rollback fails, drop connection.

        Show
        Radim Kolar added a comment - Before returning connection from client back to pool connection manager is expected to do rollback() on it, isn't it? Best method would be if rollback fails, drop connection.
        Jack Cai made changes -
        Assignee Jack Cai [ caijunj ]
        Hide
        David Jencks added a comment -

        Absolutely the best solution for this kind of problem is to wrap a PooledDatasource rather than a driver. I couldn't find any when I started on tranql You should be able to make an AbstractPooledDatasourceMCF similar to the AbstractXADatasourceMCF but using locat tx.

        I don't think that only rejecting a few known exceptions is necessarily a good idea. For the generic ra I implemented a new exception sorter that allows exceptions with SQLCode from a list. However I suspect the list of non-fatal SQLCOdes could be a lot more comprehensive. Can anyone find the list in the ansi or x/open specs?

        Show
        David Jencks added a comment - Absolutely the best solution for this kind of problem is to wrap a PooledDatasource rather than a driver. I couldn't find any when I started on tranql You should be able to make an AbstractPooledDatasourceMCF similar to the AbstractXADatasourceMCF but using locat tx. I don't think that only rejecting a few known exceptions is necessarily a good idea. For the generic ra I implemented a new exception sorter that allows exceptions with SQLCode from a list. However I suspect the list of non-fatal SQLCOdes could be a lot more comprehensive. Can anyone find the list in the ansi or x/open specs?
        Hide
        Forrest Xia added a comment -

        I verified that the modified "tranql-connector-postgresql-common-1.1.jar" works for tranql's postgresql local adapter. Once db error is fixed, the geronimo server can rebuild connection to the postgresql database.

        Show
        Forrest Xia added a comment - I verified that the modified "tranql-connector-postgresql-common-1.1.jar" works for tranql's postgresql local adapter. Once db error is fixed, the geronimo server can rebuild connection to the postgresql database.
        Jack Cai made changes -
        Attachment PGtrial.patch [ 12419721 ]
        Hide
        Jack Cai added a comment -

        Attaching a modified trial patch that hard code the fatal error codes.

        Even though this might fix the problem with Postgresql, it's far from the final solution.

        I realize that tranql does not take advantange of the PooledDatasource/PooledConnection implemenations that are available in most existing DBMS JDBC drivers. The PooledConnection can notify a listener when a fatal error occured and the connection should be discarded. I'd suggest that we improve tranql to use PooledDatasource, instead of relying on a customized ExceptionSorter that we have to write by ourselves for every DBMS.

        What do others think?

        Show
        Jack Cai added a comment - Attaching a modified trial patch that hard code the fatal error codes. Even though this might fix the problem with Postgresql, it's far from the final solution. I realize that tranql does not take advantange of the PooledDatasource/PooledConnection implemenations that are available in most existing DBMS JDBC drivers. The PooledConnection can notify a listener when a fatal error occured and the connection should be discarded. I'd suggest that we improve tranql to use PooledDatasource, instead of relying on a customized ExceptionSorter that we have to write by ourselves for every DBMS. What do others think?
        Hide
        David Jencks added a comment -

        Could you show the code you are using?

        I'm not entirely sure I think that a property file is easier to update than java code. You still have to repack everything and make sure you have it the right place and the right copy.... it seems to me that rebuilding the project might be easier and it is certainly much more friendly to tracking what code you are using.

        So, without more discussion I'd be much more in favor of not having a configurable exception sorter but just getting it right for postgres.

        Show
        David Jencks added a comment - Could you show the code you are using? I'm not entirely sure I think that a property file is easier to update than java code. You still have to repack everything and make sure you have it the right place and the right copy.... it seems to me that rebuilding the project might be easier and it is certainly much more friendly to tracking what code you are using. So, without more discussion I'd be much more in favor of not having a configurable exception sorter but just getting it right for postgres.
        Jack Cai made changes -
        Hide
        Jack Cai added a comment -

        As David pointed out, there's no stardard way to tell that an SQLException is fatal. So it might be hard to make the generic adapter detect dead connection.

        I went out writing an exception sorter that will regard the following SQL states as fatal for Postgresql. The list of fatal errors can be configured through the property file packaged in the attached jar. Does anybody want to test it with Geronimo (simply replace this jar with the one in Geronimo repo)?

        CONNECTION_UNABLE_TO_CONNECT: 08001
        CONNECTION_DOES_NOT_EXIST: 08003
        CONNECTION_REJECTED: 08004
        CONNECTION_FAILURE: 08006
        CONNECTION_FAILURE_DURING_TRANSACTION: 08007
        PROTOCOL_VIOLATION: 08P01
        COMMUNICATION_ERROR: 08S01
        SYSTEM_ERROR: 60000

        Show
        Jack Cai added a comment - As David pointed out, there's no stardard way to tell that an SQLException is fatal. So it might be hard to make the generic adapter detect dead connection. I went out writing an exception sorter that will regard the following SQL states as fatal for Postgresql. The list of fatal errors can be configured through the property file packaged in the attached jar. Does anybody want to test it with Geronimo (simply replace this jar with the one in Geronimo repo)? CONNECTION_UNABLE_TO_CONNECT: 08001 CONNECTION_DOES_NOT_EXIST: 08003 CONNECTION_REJECTED: 08004 CONNECTION_FAILURE: 08006 CONNECTION_FAILURE_DURING_TRANSACTION: 08007 PROTOCOL_VIOLATION: 08P01 COMMUNICATION_ERROR: 08S01 SYSTEM_ERROR: 60000
        Hide
        Radim Kolar added a comment -

        Can you get this bug fixed? Its major reliability problem.

        Tomcat is using Jakarta commons DBCP and do not suffers from this problem. Is there way to use DBCP in Geronimo?

        Show
        Radim Kolar added a comment - Can you get this bug fixed? Its major reliability problem. Tomcat is using Jakarta commons DBCP and do not suffers from this problem. Is there way to use DBCP in Geronimo?
        Hide
        Forrest Xia added a comment -

        More findings about db connection regeneration problem.
        1. the general tranql adapter has the problem(except that the db vendor is DB2)
        2. tranql-connector-mysql-local has the problem

        So I suspect if all of vendor local adapter has the problem.

        Show
        Forrest Xia added a comment - More findings about db connection regeneration problem. 1. the general tranql adapter has the problem(except that the db vendor is DB2) 2. tranql-connector-mysql-local has the problem So I suspect if all of vendor local adapter has the problem.
        Hide
        Forrest Xia added a comment -

        Does some testing on this issue. Conclude that:
        1. tranql-connector-postgresql-xa adapter has no such problem.
        2. tranql-connector-postgresql-local does have such problem

        Besides the exception described by Radim Kolar, additional exceptions are:
        Caused by: java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:103)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:147)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:76)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:134)
        at org.postgresql.core.PGStream.flush(PGStream.java:508)
        at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:676)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
        ... 62 more

        Seems for local transaction db pool of Postgresql, the bad connection won't be threw out, for following db request, the server still return the bad connection to application.

        But for XA transaction db pool of Postgresql, no such problem.

        Show
        Forrest Xia added a comment - Does some testing on this issue. Conclude that: 1. tranql-connector-postgresql-xa adapter has no such problem. 2. tranql-connector-postgresql-local does have such problem Besides the exception described by Radim Kolar, additional exceptions are: Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:103) at java.net.SocketOutputStream.write(SocketOutputStream.java:147) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:76) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:134) at org.postgresql.core.PGStream.flush(PGStream.java:508) at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:676) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191) ... 62 more Seems for local transaction db pool of Postgresql, the bad connection won't be threw out, for following db request, the server still return the bad connection to application. But for XA transaction db pool of Postgresql, no such problem.
        David Jencks made changes -
        Fix Version/s Wish List [ 12310202 ]
        Fix Version/s 2.2 [ 12312965 ]
        Fix Version/s 2.1.5 [ 12313729 ]
        Hide
        David Jencks added a comment -

        If someone were to investigate this issue they need to look into whether some of these are not working:

        • in geronimo connection management, if a fatal error occurs then the managed connection should get destroyed and removed from the pool. The user app will get an exception, and it should do something to deal with it. For instance, it could ask the user to retry later, or it could try again with a new connection.
        • The tranql postgres wrapper needs to tell geronimo that a fatal error occurred by calling (IIIRC) a connectionErrorOccurred with an event. If this is not happening then the tranql wrapper needs to be taught how to recognize fatal errors.

        – Note also that anyone using the generic wrapper wont get very reasonable behavior because there is no standard way to tell that an SQLException means the connection is dead.

        I don't see anyone fixing this right now so it wont be in 2.2

        Show
        David Jencks added a comment - If someone were to investigate this issue they need to look into whether some of these are not working: in geronimo connection management, if a fatal error occurs then the managed connection should get destroyed and removed from the pool. The user app will get an exception, and it should do something to deal with it. For instance, it could ask the user to retry later, or it could try again with a new connection. The tranql postgres wrapper needs to tell geronimo that a fatal error occurred by calling (IIIRC) a connectionErrorOccurred with an event. If this is not happening then the tranql wrapper needs to be taught how to recognize fatal errors. – Note also that anyone using the generic wrapper wont get very reasonable behavior because there is no standard way to tell that an SQLException means the connection is dead. I don't see anyone fixing this right now so it wont be in 2.2
        Jarek Gawor made changes -
        Fix Version/s 2.1.5 [ 12313729 ]
        Fix Version/s 2.1.4 [ 12313380 ]
        Affects Version/s 2.1.4 [ 12313380 ]
        Hide
        Jarek Gawor added a comment -

        Updated affects/fix versions as this won't get fixed in time for 2.1.4.

        Show
        Jarek Gawor added a comment - Updated affects/fix versions as this won't get fixed in time for 2.1.4.
        Donald Woods made changes -
        Affects Version/s 2.1.3 [ 12313316 ]
        Fix Version/s Wish List [ 12310202 ]
        Fix Version/s 2.1.4 [ 12313380 ]
        Fix Version/s 2.2 [ 12312965 ]
        Hide
        Donald Woods added a comment -

        OK, so there is still one failing scenario (could be related to the Postgresql TranQL connector....)

        Show
        Donald Woods added a comment - OK, so there is still one failing scenario (could be related to the Postgresql TranQL connector....)
        Hide
        Radim Kolar added a comment -

        I did testing on PostgreSQL 8.2 and WAS CE 2.1.1.1. After PGSQL is restarted db pool is unusable, there is no hang, but Geronimo never recovers from that error. Trace dumps are like this.

        javax.servlet.ServletException: org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
        org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:852)
        org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:781)
        org.apache.jsp.index_jsp._jspService(index_jsp.java:535)
        org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:806)
        org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:369)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:806)

        same application running in tomcat 5.5.25 container fails only once after pgsql is restarted, following requests are fine. Probably tomcat db pool can detect failed connections to database and close them instead of recycling them to application.

        Show
        Radim Kolar added a comment - I did testing on PostgreSQL 8.2 and WAS CE 2.1.1.1. After PGSQL is restarted db pool is unusable, there is no hang, but Geronimo never recovers from that error. Trace dumps are like this. javax.servlet.ServletException: org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:852) org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:781) org.apache.jsp.index_jsp._jspService(index_jsp.java:535) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) javax.servlet.http.HttpServlet.service(HttpServlet.java:806) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:369) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267) javax.servlet.http.HttpServlet.service(HttpServlet.java:806) same application running in tomcat 5.5.25 container fails only once after pgsql is restarted, following requests are fine. Probably tomcat db pool can detect failed connections to database and close them instead of recycling them to application.
        Hide
        David Frahm added a comment -

        FWIW, upgrading WASCE seems to have resolved the connection issues we were having.

        We installed a new physical server, keeping RHEL 5.2 but using WASCE 2.1.0.1 this time. Its been running for weeks without a single connection issue. As commented above, previous server would have hung connections 2 or more times per week.

        Thanks everyone!

        Show
        David Frahm added a comment - FWIW, upgrading WASCE seems to have resolved the connection issues we were having. We installed a new physical server, keeping RHEL 5.2 but using WASCE 2.1.0.1 this time. Its been running for weeks without a single connection issue. As commented above, previous server would have hung connections 2 or more times per week. Thanks everyone!
        Donald Woods made changes -
        Fix Version/s 2.2 [ 12312965 ]
        Fix Version/s 2.1.4 [ 12313380 ]
        Affects Version/s 2.1.1 [ 12312941 ]
        Affects Version/s 2.1.2 [ 12313123 ]
        Assignee Donald Woods [ drwoods ]
        Fix Version/s 2.0.4 [ 12313465 ]
        Affects Version/s 2.1 [ 12312602 ]
        Affects Version/s 2.2 [ 12312965 ]
        Fix Version/s Wish List [ 12310202 ]
        Affects Version/s 2.1.4 [ 12313380 ]
        Hide
        Donald Woods added a comment -

        User reported that there is no hang. The lost connections are expected, unless you are using something like Oracle RAC which provides failover at the JDBC driver level. Unassigning, in case anyone else wants to look into this, but Geronimo 2.0.3-SNAPSHOT, 2.1.x and 2.2 includes the udpated txmanager fixes required to enable failover support in the TranQL connectors.

        Show
        Donald Woods added a comment - User reported that there is no hang. The lost connections are expected, unless you are using something like Oracle RAC which provides failover at the JDBC driver level. Unassigning, in case anyone else wants to look into this, but Geronimo 2.0.3-SNAPSHOT, 2.1.x and 2.2 includes the udpated txmanager fixes required to enable failover support in the TranQL connectors.
        Jay D. McHugh made changes -
        Fix Version/s 2.0.4 [ 12313465 ]
        Fix Version/s 2.0.3 [ 12313315 ]
        Donald Woods made changes -
        Fix Version/s 2.1 [ 12312602 ]
        Donald Woods made changes -
        Affects Version/s 2.1 [ 12312602 ]
        Assignee Donald Woods [ drwoods ]
        Affects Version/s 2.1.4 [ 12313380 ]
        Affects Version/s 2.1.2 [ 12313123 ]
        Affects Version/s 2.1.1 [ 12312941 ]
        Affects Version/s 2.2 [ 12312965 ]
        Donald Woods made changes -
        Fix Version/s 2.2 [ 12312965 ]
        Fix Version/s 2.1.4 [ 12313380 ]
        Donald Woods made changes -
        Status Resolved [ 5 ] Reopened [ 4 ]
        Resolution Duplicate [ 3 ]
        Hide
        Radim Kolar added a comment -

        This issue is still not fixed. I can reproduce it easily just by restarting database (tested it on app using postgresql, but it will most likely do same on mysql like in old G version). I tested it with Geronimo-2.1.1

        After db is restarted, all connections still fails, but i must correct my previous comment: G-2.1.1 doesnt hangs.

        Show
        Radim Kolar added a comment - This issue is still not fixed. I can reproduce it easily just by restarting database (tested it on app using postgresql, but it will most likely do same on mysql like in old G version). I tested it with Geronimo-2.1.1 After db is restarted, all connections still fails, but i must correct my previous comment: G-2.1.1 doesnt hangs.
        Donald Woods made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Fix Version/s 2.0.3 [ 12313315 ]
        Fix Version/s 2.1 [ 12312602 ]
        Hide
        Donald Woods added a comment -

        This should be a duplicate of GERONIMO-3834.
        Please verify against the Geronimo 2.0.3-SNAPSHOT or latest 2.1.x release and then close or reopen if you can still recreate it.

        Show
        Donald Woods added a comment - This should be a duplicate of GERONIMO-3834 . Please verify against the Geronimo 2.0.3-SNAPSHOT or latest 2.1.x release and then close or reopen if you can still recreate it.
        Hide
        Kevan Miller added a comment -

        Hi David,
        The testing I performed was totally manual.

        However, we did add additional test cases as the Connector problems were being identified, diagnosed and fixed.

        Have you had a chance to test your scenario using Geronimo 2.1.2? Was waiting to hear back, before closing this Jira out.

        Show
        Kevan Miller added a comment - Hi David, The testing I performed was totally manual. However, we did add additional test cases as the Connector problems were being identified, diagnosed and fixed. Have you had a chance to test your scenario using Geronimo 2.1.2? Was waiting to hear back, before closing this Jira out.
        Hide
        David Frahm added a comment -

        Thank you for the test results. Is that test, or some other test, now added to an automated build test suite or part of regression testing in some way?

        Show
        David Frahm added a comment - Thank you for the test results. Is that test, or some other test, now added to an automated build test suite or part of regression testing in some way?
        Hide
        Kevan Miller added a comment -

        I tried to recreate this using the Roller plugin and MySQL (on 2.1.2). I drove DB connection errors by stopping/restarting the MySQL server. The Connection Pool seemed to be working properly. Driving an idle timeout was a little tricky because of DB activity that Roller initiates. However, that too seemed to be working properly.

        As Donald mentions, I believe this is probably a bug in geronimo-connector – which has been fixed. If you could test your environments with this code, it would be greatly appreciated.

        Show
        Kevan Miller added a comment - I tried to recreate this using the Roller plugin and MySQL (on 2.1.2). I drove DB connection errors by stopping/restarting the MySQL server. The Connection Pool seemed to be working properly. Driving an idle timeout was a little tricky because of DB activity that Roller initiates. However, that too seemed to be working properly. As Donald mentions, I believe this is probably a bug in geronimo-connector – which has been fixed. If you could test your environments with this code, it would be greatly appreciated.
        Hide
        Thad West added a comment - - edited

        The Derby test case outlined by David sounds like a good idea. However, I don't think the DB should be deleted.

        The problem, as I see it, is how does the connection pool respond to stale connections. If the DB is deleted, then I see no other reasonable alternative except for the pool to throw an exception.

        However, if there is a connection timeout setting in Derby (like I outlined for MySQL...see my first comment), that would be the more accurate test. The database still needs to be available, the connections in the pool should be invalidated by the DB Engine. The Geronimo pool should gracefully recover from stale connections and get another connection.

        Show
        Thad West added a comment - - edited The Derby test case outlined by David sounds like a good idea. However, I don't think the DB should be deleted. The problem, as I see it, is how does the connection pool respond to stale connections. If the DB is deleted, then I see no other reasonable alternative except for the pool to throw an exception. However, if there is a connection timeout setting in Derby (like I outlined for MySQL...see my first comment), that would be the more accurate test. The database still needs to be available, the connections in the pool should be invalidated by the DB Engine. The Geronimo pool should gracefully recover from stale connections and get another connection.
        Hide
        David Frahm added a comment -

        I will see about trying a newer WASCE. Might not happen very soon though.

        I was trying to think about how to create a test case for this that doesn't involve our production as/400. With other people having this issue, maybe we could use a different database.

        Certainly MySQL would be better, but what about something like Derby for a unit test? That could be embedded and therefore run with every build. Maybe the test could:

        1. initialize Derby, create the database, a table, some data
        2. use the connection pool to access the data
        3. take down Derby (even delete the database?)
        4. repeat step 1
        5. test the connection pool

        I'm not sure what would be best, trying to debug the production issues or trying to create a repo test case. I guess I'll keep thinking on both and see which one yields some results.

        Show
        David Frahm added a comment - I will see about trying a newer WASCE. Might not happen very soon though. I was trying to think about how to create a test case for this that doesn't involve our production as/400. With other people having this issue, maybe we could use a different database. Certainly MySQL would be better, but what about something like Derby for a unit test? That could be embedded and therefore run with every build. Maybe the test could: 1. initialize Derby, create the database, a table, some data 2. use the connection pool to access the data 3. take down Derby (even delete the database?) 4. repeat step 1 5. test the connection pool I'm not sure what would be best, trying to debug the production issues or trying to create a repo test case. I guess I'll keep thinking on both and see which one yields some results.
        Hide
        Donald Woods added a comment -

        Is there anyway you can try recreating this on WASCE 2.0.0.2/2.1.0.0 or Geronimo 2.1.2?
        The WASCE 2.0.0.2 release includes additional txmanager manager changes that are in Geronimo 2.0.3-SNAPSHOT and Geronimo 2.1.x releases and should generate different results.....

        Show
        Donald Woods added a comment - Is there anyway you can try recreating this on WASCE 2.0.0.2/2.1.0.0 or Geronimo 2.1.2? The WASCE 2.0.0.2 release includes additional txmanager manager changes that are in Geronimo 2.0.3-SNAPSHOT and Geronimo 2.1.x releases and should generate different results.....
        David Frahm made changes -
        Attachment before and after wasce restart.txt [ 12387994 ]
        Hide
        David Frahm added a comment - - edited

        This log shows the connection errors, then a server restart and everything is happy again.

        Its not the 'called with null' error that I was remembering, but that might be because of one of the following:

        1.) The server was upgraded from 1.x to 2.x around that time, so maybe the error changed
        2.) I think the 'top-level' error is a bit different when caused by one of my JSF components; The errors in this attachment are from a standard j_security_check form authentication.

        Show
        David Frahm added a comment - - edited This log shows the connection errors, then a server restart and everything is happy again. Its not the 'called with null' error that I was remembering, but that might be because of one of the following: 1.) The server was upgraded from 1.x to 2.x around that time, so maybe the error changed 2.) I think the 'top-level' error is a bit different when caused by one of my JSF components; The errors in this attachment are from a standard j_security_check form authentication.
        Hide
        David Frahm added a comment -

        I've never seen that error, sorry.

        I'm not where I can get at the logs right now, but I do have some old notes regarding this error we would get until we restarted the server:

        WARN [GeronimoConnectionEventListener] connectionErrorOccurred called with null

        Show
        David Frahm added a comment - I've never seen that error, sorry. I'm not where I can get at the logs right now, but I do have some old notes regarding this error we would get until we restarted the server: WARN [GeronimoConnectionEventListener] connectionErrorOccurred called with null
        Hide
        Kevan Miller added a comment -

        Trying to get some time to investigate this. Maybe tonight?

        Do you see anything suspicious in your logs? Maybe something like:

        "Error occurred during execution of ExpirationMonitor TimerTask"

        Show
        Kevan Miller added a comment - Trying to get some time to investigate this. Maybe tonight? Do you see anything suspicious in your logs? Maybe something like: "Error occurred during execution of ExpirationMonitor TimerTask"
        Hide
        Radim Kolar added a comment -

        We have some problems, it can be reproduces very easily for example by restarting database server and i agree with poster that Websphere 6 doesnt have this problem. Tomcat5.5 have same problem. Its serious problem and needs to be fixed.

        Also G hangs if connection pool is full of invalid connections until they idle timeouts.

        Show
        Radim Kolar added a comment - We have some problems, it can be reproduces very easily for example by restarting database server and i agree with poster that Websphere 6 doesnt have this problem. Tomcat5.5 have same problem. Its serious problem and needs to be fixed. Also G hangs if connection pool is full of invalid connections until they idle timeouts.
        Thad West made changes -
        Attachment stacktrace.txt [ 12387575 ]
        Hide
        Thad West added a comment -

        And by Kablammo, I mean: (see stacktrace.txt)

        Show
        Thad West added a comment - And by Kablammo, I mean: (see stacktrace.txt)
        Hide
        Thad West added a comment - - edited

        We are running Geronimo 2.0.2 connecting to MySQL v5

        The default for MySQL is to close a connection after 8 hours of inactivity (overnight for us). Here are the details:
        http://dev.mysql.com/doc/refman/5.0/en/gone-away.html

        For testing purposes, you could change the time limit by setting the wait_timeout variable when you start mysql.
        http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#option_mysqld_wait_timeout

        So...MySQL invalidates the connection, but Geronimo keeps it in the pool. As soon as you get one of these connections and try to use it, kablammo!

        Show
        Thad West added a comment - - edited We are running Geronimo 2.0.2 connecting to MySQL v5 The default for MySQL is to close a connection after 8 hours of inactivity (overnight for us). Here are the details: http://dev.mysql.com/doc/refman/5.0/en/gone-away.html For testing purposes, you could change the time limit by setting the wait_timeout variable when you start mysql. http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#option_mysqld_wait_timeout So...MySQL invalidates the connection, but Geronimo keeps it in the pool. As soon as you get one of these connections and try to use it, kablammo!
        David Frahm made changes -
        Field Original Value New Value
        Environment RedHat Enterprise Linux Server v5.2

        WAS-CE v2.0.0.1, based on Geronimo v2.0.2
        Red Hat Enterprise Linux Server v5.2

        WAS-CE v2.0.0.1, based on Geronimo v2.0.2
        David Frahm created issue -

          People

          • Assignee:
            Jack Cai
            Reporter:
            David Frahm
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development