Issue Details (XML | Word | Printable)

Key: DERBY-3596
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Kristian Waagan
Reporter: Kristian Waagan
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby

Creation of logical connections from a pooled connection causes resource leak on the server

Created: 04/Apr/08 02:11 PM   Updated: 29/Jun/09 01:47 PM
Return to search
Component/s: Network Client, Network Server
Affects Version/s: 10.1.3.1, 10.2.2.0, 10.3.2.1, 10.4.1.3, 10.5.1.1
Fix Version/s: 10.4.2.0, 10.5.1.1

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works ConnectionPoolingBug.java 2008-04-04 02:14 PM Kristian Waagan 3 kB
File Licensed for inclusion in ASF works derby-3596-1a-complex_approach.diff 2008-05-21 02:39 PM Kristian Waagan 9 kB
File Licensed for inclusion in ASF works derby-3596-1b-complex_approach.diff 2008-05-28 11:28 AM Kristian Waagan 5 kB
File Licensed for inclusion in ASF works derby-3596-2a-simple_approach.diff 2008-05-21 02:39 PM Kristian Waagan 0.6 kB
File Licensed for inclusion in ASF works derby-3596-3a-test_cleanup.diff 2008-05-26 11:50 AM Kristian Waagan 3 kB
File Licensed for inclusion in ASF works derby-3596-4a-complex_check_creds.diff 2008-06-04 11:26 AM Kristian Waagan 10 kB
File Licensed for inclusion in ASF works derby-3596-5a-complex_skip_creds.diff 2008-06-10 12:55 PM Kristian Waagan 8 kB
File Licensed for inclusion in ASF works derby-3596-5a-complex_skip_creds.diff 2008-06-04 11:26 AM Kristian Waagan 8 kB
Image Attachments:

1. complex-fix-heap.png
(73 kB)

2. nofix-heap.png
(63 kB)

3. simple-fix-heap.png
(75 kB)
Issue Links:
Reference
 

Bug behavior facts: Performance
Resolution Date: 30/Jun/08 07:42 AM


 Description  « Hide
When using ClientConnectionPoolDataSource and connection pooling, a new connection / transaction is created for every new logical connection, and the resources are not freed / cleaned up in the server. They are not even cleaned up when the physical connection (ClientPooledConnection) is closed.
A logical connection is obtained by invoking ClientPooledConnection.getConnection().

I have observed that if you run the repro enough times against the same server, the number of transaction in the transaction table will be reduced now and then. I believe this is garbage collection, but I have not investigated the problem enough to tell for sure what's going on.

I have also seen some locks not being freed, causing timeouts in some applications. I don't have a repro for the lock problem at this time, but it is very likely related to this issue.

Note that XA connections are handled differently on the server, and do probably not have this problem.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Kristian Waagan added a comment - 04/Apr/08 02:12 PM
Corrected messed up affects and fix version...

Kristian Waagan added a comment - 04/Apr/08 02:14 PM
Attached repro 'ConnectionPoolingBug.java'.
It runs a test sequence and prints the number of entries in the transaction table.
You must start a network server manually, and run the repro multiple times to see the GC happening on the server.

Dyre Tjeldvoll added a comment - 04/Apr/08 02:31 PM
My understanding is that the statement cache is only available when using ClientConnectionPoolDataSource. Perhaps we should include a reference to this issue in the description of the statement cache feature in the release notes? Seems like anyone wanting to try out the statement cache will likely experience this problem.

Knut Anders Hatlen added a comment - 04/Apr/08 10:44 PM
Isn't this the same problem as the one described in DERBY-3319?

Knut Anders Hatlen added a comment - 07/Apr/08 07:19 AM
Sorry, I didn't notice that setAutoCommit(false) was commented out, so this is probably not the same issue as DERBY-3319.

Kristian Waagan added a comment - 21/May/08 02:39 PM
I have uploaded two fixes that solve the problem.
None of them are ready for commit, but I'd like to get some comments
on which approach to go for.

* 'derby-3596-1a-complex_approach.diff'
  Solves the problem by maintaining only one physical connection on the
  server side. This connection has to be reset, which includes setting
  the isolation level, resetting IDENTITY_VAL_LOCAL and cleaning out
  temporary tables.
  The patch basically contains two parts: Detecting and handling a
  deferred reset in DRDAConnThread and exposing / extending reset
  functionality (the connection classes).
  A drawback of this approach is that care must be taken to get the
  reset logic correct.

* 'derby-3596-2a-simple_approach.diff'
  Solves the problem by simply closing the existing connection on the
  server side before creating a new, if there is one.
  Only a few lines of code.
  This solution is far from optimal, since a physical connection is both
  closed and created on the server for each logical connection on the
  client side.

* Optimal approach
  This would probably be to rewrite the whole logic, both on the client
  and the server side. It seems to me there is half-baked support for a
  lot more advanced functionality than what we are using. I see what we
  have / had as a first step towards supporting enterprise multi-tiered
  connection pooling ala what DB2 has. There is support for this in
  DRDA, where for instance a pooling agent can use a single connection
  to do work for multiple clients/drivers (one must switch database and
  user "on the fly").

I'll attach a few graphs showing the heap usage of the various
approaches, and also provide a few numbers regarding performance.

I'm asking for comments on this issue!

Kristian Waagan added a comment - 21/May/08 02:50 PM
Attaching three images showing the heap usage of the various builds: no fix, simple fix and complex fix.
Note the left side axis, it is not the same across all the images!

I used the attached repro to measure performance, and JConsole to monitor heap usage. I ran the repro 10 times, each run lasting 30 seconds using Sun JDK 1.6.0 with default settings on my dual core machine with 4 GB memory.

Number of iterations done by the repro;
 - no fix: 48 000 [25 - 56]
 - simple fix: 78 000 [70 - 79]
 - complex fix: 125 000 [120 - 128]

Note that the complex fix is not optimal, but feasible without changing
the client driver as well. I haven't verified, but I believe it should
work fine with older client versions. The simple fix keeps the semantics
unchanged.

Kristian Waagan added a comment - 21/May/08 02:53 PM
Forgot to say, Derby quickly gets into trouble running real load without any of the fixes.
With all these dangling transactions, lock conflicts quickly accumulates...

Kathey Marsden added a comment - 22/May/08 08:28 PM
I am about even on the two options. The simple fix seems less risky and less likely to have bugs, but I don't know if the complex fix has other performance benefits. Just a nit on the fix. It would be good to mention the bug number in the comments where you talk about the bug, so future readers can refer to this conversation.

Kristian Waagan added a comment - 23/May/08 06:59 AM
Thanks for looking at the issue Kathey.
I'll include the Jira number in the relevant comments when I post an updated patch.

I agree that the simple fix seems less risky. I get one error when I run the regression tests with it. The failure is in one of the new tests I wrote for JDBC statement pooling, so I have to look into the test and also check how statement pooling is affected by the fix.

I do have some reservations about the more complex fix. Even though the approach is more optimal, the implementation is a bit hacky due to compatibility constraints. I wonder if it would be better to design and implement a proper mechanism at a later time.
Will we ever roll a compatibility breaking release? If not, we have to code this using version checking and keep multiple implementations on both the client and server side.

Also, please note that there are more alternatives than the two I have suggested here. For instance, today we have Database and XADatabase. It might be possible to introduce a PoolingDatabase, taken we can detect when clients are indeed working with pooled connections. The difference between this one and the current complex fix, would be that the former would map client logical connections onto embedded logical connections, whereas the latter would map client logical connections directly onto a single embedded physical connection. This has some implications regarding connection state reset and object creation.


I'll work a bit more on the test failure and getting a better grip on things.
Feel free to add more comments.

Knut Anders Hatlen added a comment - 23/May/08 10:31 AM
Regardless of which fix we choose, I think the changes to EmbedConnection.resetFromPool() should be part of it. I tried to set the schema on a logical connection with the embedded driver, and when I closed that connection and opened a new logical connection, the current schema had not been reset. (We may consider pushing the resetting all the way into LCC.resetFromPool(), by the way.)

Knut Anders Hatlen added a comment - 23/May/08 11:30 AM
I logged the problem with the embedded driver as DERBY-3690 so that we can address it separately.

Kristian Waagan added a comment - 26/May/08 11:50 AM
'derby-3596-3a-test_cleanup.diff' contains minor cleanups in a single test in StatementPoolingTest;
 - renamed test (added the word 'Physical')
 - added a missing fail()
 - changed some comments
 - added a constant for a SQL statement

Committed to trunk with revision 660165 and the 10.4 branch with revision 660168.

Kristian Waagan added a comment - 28/May/08 11:28 AM
'derby-3596-1b-complex_approach.diff' removes (now) redundant code (added under DERBY-3690) and cleans up the patch a little bit. It is actually pretty simple, and I'm leaning towards implementing this approach instead of the one I have called simple.

The simple one causes some test errors, because of some internal state on the server gets inconsistent.
Since the complex one is not that complex and gives better performance, I'm strongly considering choosing it over the other solution.

I would be grateful if someone could have a look at the patch and comment on it (again) or ask questions.
FYI, the client sends a "reconnect sequence" every time a logical connection is created on the client; EXCSAT, ACCSEC, SECCHK and ACCRDB.
This is probably not the last spin, I have to check if there are other connection properties that have to be reset for logical connection.

Knut Anders Hatlen added a comment - 28/May/08 01:01 PM
- I don't know if this can ever happen, but since we always set
deferredReset to false in parseEXCSAT() if appRequester is null, I
assume that it is possible that deferredReset is true when
parseEXCSAT() is called. As the code is now, it won't set
deferredReset to false if it's true when the method is called and it
is an XA request. Is this intended, or should deferredReset always be
set to the value of (appRequester != null && !appRequester.isXARequester())?


- I'm not sure I understand this comment:

+ // Reset the flag again. In sane builds it is used to avoid asserts, but
+ // we want to reset it as soon as possible to avoid masking real bugs.
+ this.deferredReset = false;

I don't see any asserts that check deferredReset, and I don't see how
it masks bugs.

- Is it safe to skip the user/password check at the end of
parseSECCHK() when processing a deferred reset? Earlier in that
method, the user id and password fields of the database object are
updated with whichever the SECCHK message contains, so even if the
Derby network client never changes the user id, could a malicious
client exploit this somehow?

Kristian Waagan added a comment - 28/May/08 03:06 PM
- 1) I moved the reset to before "if (appRequester != null)". It should be true for the shortest time possible, but it must survive for a few iterations in the processCommands loop after it has been set to true.

- 2) Sorry, I was at bit unclear. If you look at the end of the method processCommands, you see this inside a SanityManager.DEBUG block:

    if (!this.deferredReset && pbsd != null) {
        ...
        SanityManager.ASSERT(!pbsd.isModified(),

    This assert fails because of the connection reset I do. When the PBSD is written, we're on track again and the flag can be reset and we execute the assert again for later iterations in the processCommands loop. This workaround is required due to the "hackyness" of the fix...

- 3) No, I don't believe it is entirely safe to skip the user/password check. However, that code is what creates the new connections, which is what we want to avoid. The credentials are verified by actually connecting to the database. Some options:
    a) Make an extra connection attempt to verify credentials.
    b) Hang on to the credentials made during the initial connection and see if they have changed for the deferred resets.

    Note that our driver had the functionality to change the credentials, but it doesn't anymore. So I suppose the ability to exploit this depends on if a malicious client can "take over the TCP-IP pipe" and act as the initial user. The new user credential won't take effect.
    For the general case, this will cause some overhead, and it is my opinion that the driver shouldn't send this "reconnection sequence" at all. I'll try to implement option b) in the next patch, along with some more comments and references to the Jira issue.

Kristian Waagan added a comment - 04/Jun/08 11:26 AM
Patches 4a and 5a implements checking the credentials or skipping the SECCHK, respectively.
Note the change that the database object is not reset if we are dealing with a deferred reset (creating a new logical connection on the client).

I tried three variants of the skip approach:
 - use reader.skipDss().
 - use while loop and reader.skipBytes(),but don't check anything.
 - use while loop and make sure no invalid code points are received (uploaded patch 5a).

I ran the regression tests with the check creds patch (4a) and the two first skip approaches. Tests for the 5a patch is running now.

Opinions?
If I hear nothing, i think I'll go for patch 5a.

Kristian Waagan added a comment - 04/Jun/08 05:25 PM
Regression tests completed without failures with patch 5a applied (Sun JDK 1.6, Solaris 10).

Knut Anders Hatlen added a comment - 10/Jun/08 09:56 AM
5a looks good to me. I agree that it's better to keep this code simple and that 5a is preferable to 4a.

Kristian Waagan added a comment - 10/Jun/08 12:54 PM
Thank you for looking at the new patches Knut Anders.

Committed 'derby-3596-5a-complex_skip_creds.diff' to trunk with revision 666088.
This must be backported to 10.4, but I want to wait for a while to see if any issues are detected.

I think this patch concludes the work on the issue unless new problems surface. Note that I discovered a performance issue that I will handle under DERBY-3441. The client driver reset procedure causes the query "VALUES CURRENT SCHEMA" to be executed on every connection reset. I will post more details under the mentioned Jira issue.

Kristian Waagan added a comment - 10/Jun/08 12:55 PM
(Uploaded new 5a patch file where tabs on a single line are removed.)

Kristian Waagan added a comment - 30/Jun/08 07:42 AM
Merged fix 5a to 10.4 with revision 672719.
Regression tests ran without failures on the 10.4 branch.

Kristian Waagan added a comment - 30/Jun/08 07:45 AM
Updated fix version and also ticked the performance category.
Before this fix, resources would leak and garbage collection would cause severe (periodic) performance degradation.

Kristian Waagan added a comment - 04/Aug/08 07:33 AM
Closing, no issues reported.