|
[
Permlink
| « Hide
]
Stefan Zoerner added a comment - 26/Apr/06 03:01 PM
Maybe a skill problem or something wrong with my environments. But I am not able to extract the attached file. The download is 2.2 MB, I am able to browse the content (e.g. jar tf ...), but if I unzip it (both with jar or zip-Tools), the archive seems to be empty. Has anybody encountered comparable problems. I used both Firefox and IE to download the file -- same result.
I successfully extracted the zip. Linux + Archive Manager are my friends...
I'm looking at the bug. May I suggest two things ?
1) Can you check with RC1 to see if the problem still exist? 0.9.3 is pretty old and buggy. If you are patient, RC2 may be out by the end of this week, with less bugs 2) We will appreciate a lot a "how to" that explains how to launch the test, how to configure it, and a way to populate the initial ldap server. Without those informations, it's very difficult to reproduce the problem, as we have to browse the source to understand how it works... Thanks a lot ! Stefan, Emmanuel,
thanks for looking into this problem. I can still reproduce problems LIKE the one reported with the current trunk. However, the problem I reported was triggered by the DS problem in conjunction with a problem in my code which caused a large number of connections to be used simultaneously. This latter problem has been fixed since. As mentioned, I can still reproduce similar problems with a current version, but to save you the hassle of wading through my code - sorry for that -, I'll try to boil everything down to a simple test-case which demonstrates the problem with a pristine DS setup. Please bear with me for a few more days. Joerg Henne Ahha, interesting !
Can you just do a "netstat -a | grep 10389" ( or the equivalent under W$), and also produce a thread stack (ctrl-brk, I guess), or using jstack (http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstack.html) It can be very interesting to see how many socket are open and also how many threads have been created on server side and on client side. Emmanuel Sorry for the long delay, unfortunately I was busy with other projects.
In order to investigate this problem a but further, I wrote a simple test case which can still (RC3!) be used to trigger hangs and errors in various ways. First of all, a few words on the test case: - In order to run it, you need a running DS (obviously) and an existing path in the directory somewhere which can stand some abuse. - You can configure all that in the getEnv() method. - The test case will remove everything below the specified path! There are three tests in the test case: - testSingleThreaded() runs 100 cycles of [create 10 OUs; remove them]. For every operation a separate InitialDirContext is used. However, connection pooling is turned on in getEnv(). This test usually runs just fine. - testSingleThreadedKeepLingeringCtx() exploits a bug I made some time ago: createSubcontext() returns a subcontext which must be closed explicitely. I you don't, the SUN LDAP provider will open a large-ish number of connections (depending on when the garbage collection runs). This test hangs pretty reliably starting at around 10 open connections. - testMultiThreaded() is yet another variation of the theme, this time creating/deleting stuff in a multi-threaded fashion. Even with proper Context management this hangs very reliably, too. Bottom line: with respect to the subject of the bug I'd say that my initial assumption that there is a problem with queries was wrong. Once something triggered the hang, every connection attempt within a few seconds hangs, too. Query or not. The second test is based on a clearly broken connection management, but DS should still behave properly. The third test demonstrates the hang with a proper (IMHO) connection management, but in a perfectly legal multi-threading situation. This is how the multi-threading hang looks like: Client-side threads: org.eclipse.jdt.internal.junit.runner.RemoteTestRunner at localhost:4923 System Thread [Finalizer] (Running) System Thread [Reference Handler] (Running) Thread [main] (Running) System Thread [Signal Dispatcher] (Running) Thread [ReaderThread] (Running) Thread [Thread-0] (Running) Thread [Thread-1] (Running) Thread [Thread-4] (Running) Thread [Thread-5] (Running) Thread [Thread-9] (Running) Thread [Thread-12] (Running) Thread [Thread-13] (Running) Thread [Thread-14] (Running) Thread [Thread-15] (Running) Thread [Thread-16] (Running) Thread [Thread-17] (Running) Thread [Thread-18] (Running) Thread [Thread-19] (Running) Server-side threads (yes, I'm running DS from JBoss): org.jboss.Main at localhost:4078 System Thread [Finalizer] (Running) System Thread [Reference Handler] (Running) System Thread [Signal Dispatcher] (Running) Thread [DestroyJavaVM] (Running) Thread [Timer-0] (Running) Thread [ScannerThread] (Running) Thread [SocketAcceptor-0] (Running) Thread [DatagramAcceptor-1] (Running) Thread [JBossLifeThread] (Running) Thread [LeaderFollowerThreadPool-1] (Running) Thread [PooledByteBufferExpirer-0] (Running) Thread [SocketAcceptorIoProcessor-0.0] (Running) Thread [AnonymousIoService-3-14] (Running) And here's the netstat output: $ netstat -a|grep ldap TCP nasenbaer:ldap nasenbaer:0 ABHOEREN TCP nasenbaer:ldap localhost:3273 HERGESTELLT TCP nasenbaer:ldap localhost:3277 HERGESTELLT TCP nasenbaer:ldap localhost:3323 HERGESTELLT TCP nasenbaer:ldap localhost:3327 HERGESTELLT TCP nasenbaer:ldap localhost:3331 HERGESTELLT TCP nasenbaer:ldap localhost:3401 HERGESTELLT TCP nasenbaer:ldap localhost:3405 HERGESTELLT TCP nasenbaer:ldap localhost:3470 HERGESTELLT TCP nasenbaer:ldap localhost:3474 HERGESTELLT TCP nasenbaer:ldap localhost:3478 HERGESTELLT TCP nasenbaer:ldap localhost:3566 HERGESTELLT TCP nasenbaer:ldap localhost:3570 HERGESTELLT TCP nasenbaer:ldap localhost:3574 HERGESTELLT TCP nasenbaer:ldap localhost:3578 HERGESTELLT TCP nasenbaer:ldap localhost:3582 HERGESTELLT TCP nasenbaer:ldap localhost:3878 HERGESTELLT TCP nasenbaer:ldap localhost:3969 HERGESTELLT TCP nasenbaer:ldap localhost:3985 HERGESTELLT TCP nasenbaer:ldap localhost:3988 HERGESTELLT TCP nasenbaer:ldap localhost:4022 HERGESTELLT TCP nasenbaer:ldap localhost:4072 HERGESTELLT TCP nasenbaer:ldap localhost:4076 HERGESTELLT TCP nasenbaer:ldap localhost:4080 HERGESTELLT TCP nasenbaer:ldap localhost:4084 HERGESTELLT TCP nasenbaer:ldap localhost:4088 HERGESTELLT TCP nasenbaer:ldap localhost:4180 HERGESTELLT TCP nasenbaer:ldap localhost:4184 HERGESTELLT TCP nasenbaer:ldap localhost:4188 HERGESTELLT TCP nasenbaer:ldap localhost:4192 HERGESTELLT TCP nasenbaer:ldap localhost:4196 HERGESTELLT TCP nasenbaer:ldap localhost:4317 HERGESTELLT TCP nasenbaer:ldap localhost:4321 HERGESTELLT TCP nasenbaer:ldap localhost:4325 HERGESTELLT TCP nasenbaer:ldap localhost:4329 HERGESTELLT TCP nasenbaer:ldap localhost:4333 HERGESTELLT TCP nasenbaer:ldap localhost:4447 HERGESTELLT TCP nasenbaer:ldap localhost:4455 HERGESTELLT TCP nasenbaer:ldap localhost:4459 HERGESTELLT TCP nasenbaer:ldap localhost:4463 HERGESTELLT TCP nasenbaer:ldap localhost:4467 HERGESTELLT TCP nasenbaer:ldap localhost:4865 HERGESTELLT TCP nasenbaer:ldap localhost:4869 HERGESTELLT TCP nasenbaer:ldap localhost:4873 HERGESTELLT TCP nasenbaer:ldap localhost:4877 HERGESTELLT TCP nasenbaer:ldap localhost:4881 HERGESTELLT TCP nasenbaer:ldap localhost:4899 HERGESTELLT TCP nasenbaer:ldap localhost:4903 HERGESTELLT TCP nasenbaer:ldap localhost:4913 HERGESTELLT TCP nasenbaer:ldap localhost:4914 HERGESTELLT TCP nasenbaer:ldap localhost:4915 HERGESTELLT TCP nasenbaer:ldap localhost:4940 HERGESTELLT TCP nasenbaer:ldap localhost:4944 HERGESTELLT TCP nasenbaer:ldap localhost:4948 HERGESTELLT TCP nasenbaer:ldap localhost:4950 HERGESTELLT TCP nasenbaer:ldap localhost:4952 HERGESTELLT TCP nasenbaer:ldap localhost:4956 HERGESTELLT TCP nasenbaer:ldap localhost:4958 HERGESTELLT TCP nasenbaer:ldap localhost:4961 HERGESTELLT TCP nasenbaer:ldap localhost:4964 HERGESTELLT TCP nasenbaer:ldap localhost:4967 HERGESTELLT TCP nasenbaer:ldap localhost:4970 HERGESTELLT TCP nasenbaer:3875 localhost:ldap HERGESTELLT TCP nasenbaer:3878 localhost:ldap HERGESTELLT I have tested with an embeded ApacheDS server, and it does not hang.
I guess the problem could rely in the network part (MINA ?) I have to check that. I just investigated this issue a bit further. It is definitely a problem with the network layer, but I could not yet figure out what is going on.
What I have found: A hanging client thread looks like this: Thread [Thread-10] (Suspended) waiting for: com.sun.jndi.ldap.LdapRequest (id=37) java.lang.Object.wait(long) line: not available [native method] com.sun.jndi.ldap.Connection.readReply(com.sun.jndi.ldap.LdapRequest) line: 418 com.sun.jndi.ldap.LdapClient.processReply(com.sun.jndi.ldap.LdapRequest, com.sun.jndi.ldap.LdapResult, int) line: 857 com.sun.jndi.ldap.LdapClient.add(com.sun.jndi.ldap.LdapEntry, javax.naming.ldap.Control[]) line: 1008 com.sun.jndi.ldap.LdapCtx.c_bind(javax.naming.Name, java.lang.Object, javax.naming.directory.Attributes, com.sun.jndi.toolkit.ctx.Continuation) line: 375 com.sun.jndi.ldap.LdapCtx(com.sun.jndi.toolkit.ctx.ComponentDirContext).p_bind(javax.naming.Name, java.lang.Object, javax.naming.directory.Attributes, com.sun.jndi.toolkit.ctx.Continuation) line: 277 com.sun.jndi.ldap.LdapCtx(com.sun.jndi.toolkit.ctx.PartialCompositeDirContext).bind(javax.naming.Name, java.lang.Object, javax.naming.directory.Attributes) line: 197 com.sun.jndi.ldap.LdapCtx(com.sun.jndi.toolkit.ctx.PartialCompositeDirContext).bind(java.lang.String, java.lang.Object, javax.naming.directory.Attributes) line: 186 javax.naming.directory.InitialDirContext.bind(java.lang.String, java.lang.Object, javax.naming.directory.Attributes) line: 158 com.levigo.tcat.test.directory.TestHang.create(java.lang.String) line: 77 com.levigo.tcat.test.directory.TestHang.createAndDelete() line: 93 com.levigo.tcat.test.directory.TestHang.access$0(com.levigo.tcat.test.directory.TestHang) line: 83 com.levigo.tcat.test.directory.TestHang$MyRunner.run() line: 41 There is a corresponding com.sun.jndi.ldap.Connection worker thread, that goes along with it - in this case the following one: Thread [Thread-17] (Suspended) owns: java.io.BufferedInputStream (id=36) java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) line: not available [native method] java.net.SocketInputStream.read(byte[], int, int) line: 129 java.io.BufferedInputStream.fill() line: 218 java.io.BufferedInputStream.read1(byte[], int, int) line: 256 java.io.BufferedInputStream.read(byte[], int, int) line: 313 com.sun.jndi.ldap.Connection.run() line: 780 [local variables unavailable] java.lang.Thread.run() line: 595 This worker thread looks normal - it is only used for reading, never for writing. When a hang occurs, it is due to a message gone missing. The message has been sent by com.sun.jndi.ldap.LdapClient, prior to calling Connection.readReply(). I added some debugging output directly in org.apache.mina.transport.socket.nio.process(Set). Thus I can see the excact port numbers for which data becomes available. What I learned, is that data never becomes available for the corresponding socket (except for the initial bind call). After some more debugging I found that the channel has, for some reason, gine missing from the SocketIoProcessor's Selector. Well, and now the big question is, of couse: why does the channel vanish? Maybe this rings a bell with you one of you? Can you tell us which JVM and version you are using?
We already experienced weird things with some versions of JRockit JVM... Just in case. Something that could hep would be to put a Lap Proxy in the middle. Either you use ethereal, but it's not very easy to use under Windows :), or you can use the LdapProxy which is in sandbox : http://svn.apache.org/viewvc/directory/sandbox/trunk/proxy/ Doco is here : http://wiki.apache.org/directory/ProxyHome?highlight=%28proxy%29 I am using 1.5.0_03. Good 'ol SUN, of course.
Ethereal? LdapProxy? Well, good idea, but stumbling blocks everywhere: - Ethereal can only capture traffic crossing a non-loopback interface. And "crossing" means actual electrons being shoved around, in contrast to data being routed internally. So I tried to run the test from a secondary system. Unfortunately this seems to skew the timing so much, that it just doesn't hang anymore. Even with more threads and everything. - Next try: LdapProxy. After some pushing and shoving (want an M2 pom.xml for it?), I finally got the dependencies right and the thing to compile. Result: same problem. LdapProxy is really, really slow and thus reliably prevents hangs. - Yet another idea: lauch my linux VMWare VM and try it from there. Still no hang. *sigh* Bottom line: *sigh* Does anybody have some more ideas? I'm kind of relievied to see that the hangs don't seem to occur remotely. This makes the problem less urgent, but a bad feeling persists. And for crying out loud: I've dumped almost a day into this darned problem. This problem is going wild.
Ok. Let's summarize : - You are using SUN JDK 1.5.0_03 - Your are on windows - Hang just happens when you use ADS locally, and if you don't slow it down with artifacts like LdapProxy From the last point, what I can just suspect is a synchronization problem. Let's try some more move : 1) launch your test, and when it's hanging, try to connect to ADS with LdapBrowser or JXplorer. If it does not respond, ok, this is ADS 2) give a try to the latest version of sun JVM : 1.5.0_07. There are a bunch of fixes since 1.5.0_03 (http://java.sun.com/j2se/1.5.0/ReleaseNotes.html) 3) give a try to IBM 1.5 JVM, or the latest JRockit JVM. They are both really fast. 4) A thread dump could help : client and server. On windows, I think that you can use such a tool like http://www.latenighthacking.com/projects/2003/sendSignal/ to generate stack traces, then you have other tools to analyze them. With IBM JVM, I think it's easier. On Linux, kill -3 is enough. 5) Kill a chicken, drink its blood and throw some salt on the ground, while singing under the moon. Ok, it may not help, but who knows ? Sorry that you spoiled a day to reach this conclusion : "I've dumped almost a day into this darned problem.". The positive point is that if we finally found the reason *and* the fix, then other users will benefit from yout work. What else can I say? at 1AM, nothing more, I think :( Thanks for your continued feedback, Emmanuel!
I'll answer your points one-by-one: 1) Even other threads within my test case can continue their work undisturbed. Connections from other sources are also not problem at all. The symptom is simply that some connections seem to just go dead. To give you an idea of how many are affected: I usually run the tes with 10 threads, each executing about 200 interactions with the server (100 object creations, 100 deletions). Of those 10 threads usually about 1-3 run into the hang. As stated earlier: when a connection runs into the hung state, this causes the corresponding channel to not be returned from Selector.select() calls. My earlier observation, that the channel is completely lost from the selector's channel list was bunk, btw. It is still there, but simply not selected. This may very well be a problem with the runtime libraries or even the LDAP client, BTW. 2) Good idea, but still: hangs as before. 3) JRockit: hey, I've wanted to try this for a long time. Now it's time to do so. Test 1: Server on JRockit, client (unit test) on SUN: still hangs. Test 2: Server on JRockit, client on JRockit: no hang. What was that? Several tries: IT! DOESN'T! HANG! wow. Test 3: Server in SUN, client on JRockit: still not hang. Interesting. IBM JVM: Test 1: server on SUN client on IBM: hang! Test 2: server on IBM, client on IBM: hang! Observation on the side: the test runs 3-4 times slower on IBM and SUN JVMs (even though some Threads don't even make it to the end due to a hang!) compared to JRockit. The effect on the server side seems to be far less pronounced, which might me due to the log output in the client side. While we're at it, some completely unscientific benchmarks. The client is always on JRockit (since it is the only way the client always makes it to the end, it doesn't make sense to compare using other JVMs for the client) and run multiple times to allow for some JIT burn-in: - SUN 1.5.0_07: ~5000ms per test run. - IBM 1.5: ~5700ms - JRockit: ~4000ms (OMG!) 4) The thread dump is not a problem. I have both client and server running under full debugger control and can plainly see what all the threads are doing. A TCP capture would be very, very interesting, but I don't know how I can capture traffic which doesn't actually cross a physical network interface. 5) Unfortunately, I don't have any chickens at hand (lucky them!), but to draw some conclusion: one possible explanation would be that the problems are caused by the different IO libraries used by the different JVMs (see thread dumps below). The cause could also be a problem on the server side which is triggered by certain timing differences between the client JVMs. However, I think the former seems more likely to me, because the hangs don't seem to be influenced by the client timing itself. In fact, I first got the hangs using my OLM (object to LDAP mapping) framework which surely has very, very different timing characteristics. Here's a stack dump of a JRockit reader thread: Thread [Thread-4] (Suspended) owns: java.io.BufferedInputStream (id=51) jrockit.net.SocketNativeIO.readBytesPinned(int, byte[], int, int, int) line: not available [native method] jrockit.net.SocketNativeIO.socketRead(java.io.FileDescriptor, byte[], int, int, int) line: not available java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) line: not available java.net.SocketInputStream.read(byte[], int, int) line: 129 java.io.BufferedInputStream.fill() line: 218 java.io.BufferedInputStream.read1(byte[], int, int) line: 256 java.io.BufferedInputStream.read(byte[], int, int) line: 313 com.sun.jndi.ldap.Connection.run() line: 784 java.lang.Thread.run() line: not available This is from the IBM JVM: Thread [Thread-17] (Suspended) owns: java.io.BufferedInputStream (id=45) java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) line: not available [native method] java.net.SocketInputStream.read(byte[], int, int) line: 155 java.io.BufferedInputStream.fill() line: 229 java.io.BufferedInputStream.read1(byte[], int, int) line: 267 java.io.BufferedInputStream.read(byte[], int, int) line: 324 com.sun.jndi.ldap.Connection.run() line: 814 java.lang.Thread.run() line: 788 And this is, finally, the SUN JVM: Thread [Thread-31] (Suspended) owns: java.io.BufferedInputStream (id=60) java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) line: not available [native method] java.net.SocketInputStream.read(byte[], int, int) line: 129 java.io.BufferedInputStream.fill() line: 218 java.io.BufferedInputStream.read1(byte[], int, int) line: 256 java.io.BufferedInputStream.read(byte[], int, int) line: 313 com.sun.jndi.ldap.Connection.run() line: 784 java.lang.Thread.run() line: 595 I'm not saying that this specific class is the culprit - it is rather the write-side of the communication which is the problem, but the stack dumps indicate that JRockit has very different socket-IO code compared to SUN/IBM. Wild guess: IBM licensed Looks like E has some history on this issue. I've been too late with it. Hope you don't mind Emmanuel that I transfer this issue to you.
Thanks. I loaded the attached bugreport project in my eclipse workbench and ran the tests including TestHang.java. The tests don't hang but they prints out some errors:
INFO 2006-08-31 18:49:56,533 [ldap.Mapping; main]: LDAP mapping initialized DEBUG 2006-08-31 18:49:56,536 [ldap.Mapping; main]: load(): type=class com.levigo.tcat.common.model.OrganizationalUnit, ctx=javax.naming.directory.InitialDirContext@2c1e6b, dn=ou=appgroups DEBUG 2006-08-31 18:49:56,537 [ldap.TypeMapping; main]: loading object of class com.levigo.tcat.common.model.OrganizationalUnit for dn=ou=appgroups DEBUG 2006-08-31 18:49:56,537 [ldap.Mapping; main]: create(): create=class com.levigo.tcat.common.model.OrganizationalUnit DEBUG 2006-08-31 18:49:56,538 [ldap.Mapping; main]: save(): object=com.levigo.tcat.common.model.OrganizationalUnit@811c88, ctx=javax.naming.directory.InitialDirContext@2c1e6b, baseDN= DEBUG 2006-08-31 18:49:56,538 [ldap.AttributeMapping; main]: dehydrating object of type class com.levigo.tcat.common.model.OrganizationalUnit DEBUG 2006-08-31 18:49:56,539 [ldap.AttributeMapping; main]: dehydrating object of type class com.levigo.tcat.common.model.OrganizationalUnit DEBUG 2006-08-31 18:49:56,543 [ldap.Mapping; main]: list(): type=class com.levigo.tcat.common.model.Client, ctx=null, filter=null, searchBase=null DEBUG 2006-08-31 18:49:56,546 [ldap.TypeMapping; main]: listing objects of class com.levigo.tcat.common.model.Client for base=ou=clients, filter=null com.levigo.tcat.common.directory.DirectoryException: Can't save object at com.levigo.tcat.common.directory.ldap.TypeMapping.save(TypeMapping.java:354) at com.levigo.tcat.common.directory.ldap.Mapping.save(Mapping.java:173) at com.levigo.tcat.common.directory.ldap.LDAPDirectory.setupBaseStructure(LDAPDirectory.java:116) at com.levigo.tcat.common.directory.ldap.LDAPDirectory.<init>(LDAPDirectory.java:88) at LDAPDirectoryTest.getDirectory(LDAPDirectoryTest.java:34) at LDAPDirectoryTest.setUp(LDAPDirectoryTest.java:42) at junit.framework.TestCase.runBare(TestCase.java:125) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) Caused by: com.levigo.tcat.common.directory.DirectoryException: Can't marshal instance of class com.levigo.tcat.common.model.OrganizationalUnit at com.levigo.tcat.common.directory.ldap.TypeMapping.saveNewObject(TypeMapping.java:393) at com.levigo.tcat.common.directory.ldap.TypeMapping.save(TypeMapping.java:352) ... 18 more Caused by: javax.naming.OperationNotSupportedException: [LDAP: error code 53 - no global superior knowledge]; remaining name 'ou=appgroups' at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3058) at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2931) at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2737) at com.sun.jndi.ldap.LdapCtx.c_createSubcontext(LdapCtx.java:770) at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_createSubcontext(ComponentDirContext.java:319) at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.createSubcontext(PartialCompositeDirContext.java:248) at javax.naming.directory.InitialDirContext.createSubcontext(InitialDirContext.java:183) at com.levigo.tcat.common.directory.ldap.TypeMapping.saveNewObject(TypeMapping.java:374) ... 19 more Any clue? Ah I didn't launch the LDAP server ;)
I guess MINA 1.0-SNAPSHOT might have resolved this problem. Please try to upgrade, and let me know the result.
> I guess MINA 1.0-SNAPSHOT might have resolved this problem. Please try to upgrade, and let me know the result.
Unfortunately, the problem still persists with the latest DS 1.0-trunks which references MINA 1.0-SNAPSHOT. I'll try to dig deeper... After a thorough debugging session I've come to the conclusion that this is, in fact, not a problem of either MINA or DS, but a problem generated by Windows XP's application level gateway which is part of the Windows internet firewall. Sorry for accusing non-culprits for this mess... :-/
Just in case anybody cares, I'll give a quick roundup of what I found: I started by generating traces of calls and data flow both on the client and the server side by adding appropriate debugging code to MINA's SocketIoProcessor and SUN's LDAP Connection object (the latter by downloading and modifying the sources). I generated separate traces per connection, i.e. text files named after the local port number of the client. Early on I noticed that the port numbers on the client and the server didn't match, because of the fact that the Windows internet firewall proxies those calls through the application level gateway (i.e. there are in fact two connections, one from the client to the gateway and one from the gateway to the server - all of which can be seen using netstat or Sysinternal's TCPView). I wasn't terribly worried about this, because things should work even with the gateway in place. One interesting thing I noticed is that under high networking loads, i.e. about 20 active and open connections, the application level gateway seems to "lose it", which is indicated by new connections being made directly, bypassing the application level gateway. In other words: for some new connections the port numbers did suddenly match up. Note to the guys with the black hats: you' may want to try to by-pass the application level gateway by inundating it with connections for a brief period. Anyway, back to the problem: once those "direct" connections start to occur, some other, previously existing connections seem to go dead: the client sends something, but the server never receives anything causing the client to time out. The weird thing about the application level gateway is that it is not only used for connections crossing a protected gateway, but for all connections, even local loopbacks. In other words: if you have even one interface with an active firewall in your system (which I do, for the wireless interface), even if this interface is down, all TCP connections go through the application level gateway. Well, an of course the punchline of all that is: once you completely turn off the Windows internet firewall by shutting down the respective service, everything works fine and rock-solid again. *sigh* Yuk !!! Man, just quit using Windows... It's a kind of ramping cancer ;)
Ok, that's great news ! I'm very happy that nor MINA neither ADS are culprit for your problems. I gonna close the issue. btw, we have released ADS 1.0 this week, and we successfully obatianed the Open Group certification. It's not an ADS nor MINA bug. Windows Firewall is to be blamed for it ...
[[ Old comment, sent by email on Mon, 11 Sep 2006 12:26:05 +0200 ]] Trustin, Unfortunately, the first test was way too complicated, because I extracted it from my application code instead of trying to just cause the hang. The attached TestHang.java causes the same hang with a lot less code. I'll look into 1.0-SNAPSHOT asap. Joerg Henne [[ Old comment, sent by email on Thu, 12 Oct 2006 16:53:04 +0200 ]] well, I'd do it, but tell that to my customers... Yes, that's really great news. A big thanks and congratulations for that achievement. I've been bogged down with other work and therefore unable to work on ADS, but tackling that stupid hang-problem was a start again. Joerg Henne |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||