|
[
Permlink
| « Hide
]
Kristian Waagan added a comment - 04/Apr/08 02:12 PM
Corrected messed up affects and fix version...
Attached repro 'ConnectionPoolingBug.java'.
It runs a test sequence and prints the number of entries in the transaction table. You must start a network server manually, and run the repro multiple times to see the GC happening on the server. My understanding is that the statement cache is only available when using ClientConnectionPoolDataSource. Perhaps we should include a reference to this issue in the description of the statement cache feature in the release notes? Seems like anyone wanting to try out the statement cache will likely experience this problem.
Isn't this the same problem as the one described in
Sorry, I didn't notice that setAutoCommit(false) was commented out, so this is probably not the same issue as
I have uploaded two fixes that solve the problem.
None of them are ready for commit, but I'd like to get some comments on which approach to go for. * 'derby-3596-1a-complex_approach.diff' Solves the problem by maintaining only one physical connection on the server side. This connection has to be reset, which includes setting the isolation level, resetting IDENTITY_VAL_LOCAL and cleaning out temporary tables. The patch basically contains two parts: Detecting and handling a deferred reset in DRDAConnThread and exposing / extending reset functionality (the connection classes). A drawback of this approach is that care must be taken to get the reset logic correct. * 'derby-3596-2a-simple_approach.diff' Solves the problem by simply closing the existing connection on the server side before creating a new, if there is one. Only a few lines of code. This solution is far from optimal, since a physical connection is both closed and created on the server for each logical connection on the client side. * Optimal approach This would probably be to rewrite the whole logic, both on the client and the server side. It seems to me there is half-baked support for a lot more advanced functionality than what we are using. I see what we have / had as a first step towards supporting enterprise multi-tiered connection pooling ala what DB2 has. There is support for this in DRDA, where for instance a pooling agent can use a single connection to do work for multiple clients/drivers (one must switch database and user "on the fly"). I'll attach a few graphs showing the heap usage of the various approaches, and also provide a few numbers regarding performance. I'm asking for comments on this issue! Attaching three images showing the heap usage of the various builds: no fix, simple fix and complex fix.
Note the left side axis, it is not the same across all the images! I used the attached repro to measure performance, and JConsole to monitor heap usage. I ran the repro 10 times, each run lasting 30 seconds using Sun JDK 1.6.0 with default settings on my dual core machine with 4 GB memory. Number of iterations done by the repro; - no fix: 48 000 [25 - 56] - simple fix: 78 000 [70 - 79] - complex fix: 125 000 [120 - 128] Note that the complex fix is not optimal, but feasible without changing the client driver as well. I haven't verified, but I believe it should work fine with older client versions. The simple fix keeps the semantics unchanged. Forgot to say, Derby quickly gets into trouble running real load without any of the fixes.
With all these dangling transactions, lock conflicts quickly accumulates... I am about even on the two options. The simple fix seems less risky and less likely to have bugs, but I don't know if the complex fix has other performance benefits. Just a nit on the fix. It would be good to mention the bug number in the comments where you talk about the bug, so future readers can refer to this conversation.
Thanks for looking at the issue Kathey.
I'll include the Jira number in the relevant comments when I post an updated patch. I agree that the simple fix seems less risky. I get one error when I run the regression tests with it. The failure is in one of the new tests I wrote for JDBC statement pooling, so I have to look into the test and also check how statement pooling is affected by the fix. I do have some reservations about the more complex fix. Even though the approach is more optimal, the implementation is a bit hacky due to compatibility constraints. I wonder if it would be better to design and implement a proper mechanism at a later time. Will we ever roll a compatibility breaking release? If not, we have to code this using version checking and keep multiple implementations on both the client and server side. Also, please note that there are more alternatives than the two I have suggested here. For instance, today we have Database and XADatabase. It might be possible to introduce a PoolingDatabase, taken we can detect when clients are indeed working with pooled connections. The difference between this one and the current complex fix, would be that the former would map client logical connections onto embedded logical connections, whereas the latter would map client logical connections directly onto a single embedded physical connection. This has some implications regarding connection state reset and object creation. I'll work a bit more on the test failure and getting a better grip on things. Feel free to add more comments. Regardless of which fix we choose, I think the changes to EmbedConnection.resetFromPool() should be part of it. I tried to set the schema on a logical connection with the embedded driver, and when I closed that connection and opened a new logical connection, the current schema had not been reset. (We may consider pushing the resetting all the way into LCC.resetFromPool(), by the way.)
I logged the problem with the embedded driver as
'derby-3596-3a-test_cleanup.diff' contains minor cleanups in a single test in StatementPoolingTest;
- renamed test (added the word 'Physical') - added a missing fail() - changed some comments - added a constant for a SQL statement Committed to trunk with revision 660165 and the 10.4 branch with revision 660168. 'derby-3596-1b-complex_approach.diff' removes (now) redundant code (added under
The simple one causes some test errors, because of some internal state on the server gets inconsistent. Since the complex one is not that complex and gives better performance, I'm strongly considering choosing it over the other solution. I would be grateful if someone could have a look at the patch and comment on it (again) or ask questions. FYI, the client sends a "reconnect sequence" every time a logical connection is created on the client; EXCSAT, ACCSEC, SECCHK and ACCRDB. This is probably not the last spin, I have to check if there are other connection properties that have to be reset for logical connection. - I don't know if this can ever happen, but since we always set
deferredReset to false in parseEXCSAT() if appRequester is null, I assume that it is possible that deferredReset is true when parseEXCSAT() is called. As the code is now, it won't set deferredReset to false if it's true when the method is called and it is an XA request. Is this intended, or should deferredReset always be set to the value of (appRequester != null && !appRequester.isXARequester())? - I'm not sure I understand this comment: + // Reset the flag again. In sane builds it is used to avoid asserts, but + // we want to reset it as soon as possible to avoid masking real bugs. + this.deferredReset = false; I don't see any asserts that check deferredReset, and I don't see how it masks bugs. - Is it safe to skip the user/password check at the end of parseSECCHK() when processing a deferred reset? Earlier in that method, the user id and password fields of the database object are updated with whichever the SECCHK message contains, so even if the Derby network client never changes the user id, could a malicious client exploit this somehow? - 1) I moved the reset to before "if (appRequester != null)". It should be true for the shortest time possible, but it must survive for a few iterations in the processCommands loop after it has been set to true.
- 2) Sorry, I was at bit unclear. If you look at the end of the method processCommands, you see this inside a SanityManager.DEBUG block: if (!this.deferredReset && pbsd != null) { ... SanityManager.ASSERT(!pbsd.isModified(), This assert fails because of the connection reset I do. When the PBSD is written, we're on track again and the flag can be reset and we execute the assert again for later iterations in the processCommands loop. This workaround is required due to the "hackyness" of the fix... - 3) No, I don't believe it is entirely safe to skip the user/password check. However, that code is what creates the new connections, which is what we want to avoid. The credentials are verified by actually connecting to the database. Some options: a) Make an extra connection attempt to verify credentials. b) Hang on to the credentials made during the initial connection and see if they have changed for the deferred resets. Note that our driver had the functionality to change the credentials, but it doesn't anymore. So I suppose the ability to exploit this depends on if a malicious client can "take over the TCP-IP pipe" and act as the initial user. The new user credential won't take effect. For the general case, this will cause some overhead, and it is my opinion that the driver shouldn't send this "reconnection sequence" at all. I'll try to implement option b) in the next patch, along with some more comments and references to the Jira issue. Patches 4a and 5a implements checking the credentials or skipping the SECCHK, respectively.
Note the change that the database object is not reset if we are dealing with a deferred reset (creating a new logical connection on the client). I tried three variants of the skip approach: - use reader.skipDss(). - use while loop and reader.skipBytes(),but don't check anything. - use while loop and make sure no invalid code points are received (uploaded patch 5a). I ran the regression tests with the check creds patch (4a) and the two first skip approaches. Tests for the 5a patch is running now. Opinions? If I hear nothing, i think I'll go for patch 5a. Regression tests completed without failures with patch 5a applied (Sun JDK 1.6, Solaris 10).
5a looks good to me. I agree that it's better to keep this code simple and that 5a is preferable to 4a.
Thank you for looking at the new patches Knut Anders.
Committed 'derby-3596-5a-complex_skip_creds.diff' to trunk with revision 666088. This must be backported to 10.4, but I want to wait for a while to see if any issues are detected. I think this patch concludes the work on the issue unless new problems surface. Note that I discovered a performance issue that I will handle under (Uploaded new 5a patch file where tabs on a single line are removed.)
Merged fix 5a to 10.4 with revision 672719.
Regression tests ran without failures on the 10.4 branch. Updated fix version and also ticked the performance category.
Before this fix, resources would leak and garbage collection would cause severe (periodic) performance degradation. Closing, no issues reported.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||