Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.4, 3.5.0
-
None
-
Reviewed
Description
To reproduce, try creating a node with 0.5M of data using java client. The test will hang waiting for a response from the server. See the attached patch for the test that reproduces the issue.
It seems that ZOOKEEPER-1437 introduced a few issues to ClientCnxnSocketNIO.doIO that prevent ClientCnxnSocketNIO from sending large packets that require several invocations of SocketChannel.write to complete. The first issue is that the call to outgoingQueue.removeFirstOccurrence(p); removes the packet from the queue even if the packet wasn't completely sent yet. It looks to me that this call should be moved under if (!pbb.hasRemaining()) The second issue is that p.createBB() is reinitializing ByteBuffer on every iteration, which confuses SocketChannel.write. And the third issue is caused by extra calls to cnxn.getXid() that increment xid on every iteration and confuse the server.
Attachments
Attachments
- ZOOKEEPER-1560.patch
- 0.9 kB
- Igor Motov
- zookeeper-1560-v1.txt
- 2 kB
- Ted Yu
- zookeeper-1560-v2.txt
- 3 kB
- Ted Yu
- zookeeper-1560-v3.txt
- 3 kB
- Ted Yu
- zookeeper-1560-v4.txt
- 4 kB
- Ted Yu
- zookeeper-1560-v5.txt
- 4 kB
- Ted Yu
- zookeeper-1560-v6.txt
- 5 kB
- Ted Yu
- zookeeper-1560-v7.txt
- 5 kB
- Ted Yu
- ZOOKEEPER-1560-v8.patch
- 6 kB
- Skye Wanderman-Milne
- ZOOKEEPER-1560-v8_r3.patch
- 7 kB
- Skye Wanderman-Milne
- ZOOKEEPER-1560-v8_r4.patch
- 8 kB
- Skye Wanderman-Milne
Issue Links
- relates to
-
ZOOKEEPER-1561 Zookeeper client may hang on a server restart
- Resolved
Activity
In doIO(), should we check the return value from:
sock.write(pbb);
Here is jstack where testLargeNodeData hung:
"main" prio=5 tid=7f9bed000800 nid=0x10c382000 in Object.wait() [10c380000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <7dc1a15c0> (a org.apache.zookeeper.ClientCnxn$Packet) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) - locked <7dc1a15c0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:781) at org.apache.zookeeper.test.ClientTest.testLargeNodeData(ClientTest.java:531)
I think we can send data in chunks if pbb.remaining() is beyond certain threshold.
Patch v1 implements the chunk'ed write.
ClientTest passes.
Please provide your comments.
junit.run:
[junit] Running org.apache.zookeeper.test.ClientTest
[junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 17.694 sec
Hi Ted, I think originalLimit can be final, just as CHUNK_SIZE is. Also should CHUNK_SIZE be a configurable setting, defaulting to 2000?
should CHUNK_SIZE be a configurable setting
If I knew how to access Configuration instance from within ClientCnxnSocketNIO
originalLimit can be final
Sure. Will incorporate into next patch after I get more review comments.
Hi Ted,
For making the chunk size configurable, I think something like this would work: (add to same file as you are patching (ClientCnxnSocketNIO.java))
public static final String CLIENT_WRITE_CHUNK_SIZE = "zookeeper.client.nio.writechunk"; ... final int CHUNK_SIZE = getProperty(CLIENT_WRITE_CHUNK_SIZE,2000);
Thanks Eugene for the hint.
Here is patch v2 which passed ClientTest as well.
Hi Ted,
why not test for the return value of sock.write ?
perhaps if it couldn't write the entire chunk size we can raise an exception.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548823/zookeeper-1560-v1.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1212//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1212//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1212//console
This message is automatically generated.
@Ted, your patch looks good as far as configurability/final-ness.
maybe IOException since this is what sock.write may throw anyway.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548828/zookeeper-1560-v2.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1213//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1213//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1213//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548839/zookeeper-1560-v3.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1214//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1214//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1214//console
This message is automatically generated.
I couldn't reproduce the error in test.ClientTest
Looking at
http://docs.oracle.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#write%28java.nio.ByteBuffer%29
Unless otherwise specified, a write operation will return only after writing all of the r requested bytes.
I think the additional check in patch v3 may not be needed
Hi Ted,
instead of int expectedSize = pbb.limit() - pbb.position();
you can use pbb.remaining().
BTW, the JIRA description mentions several problems with the same block of code.
The first one of them (dequeuing before we know sending succeeded) also causes ZK-1561.
Please correct me if I'm wrong but seems like the current patch doesn't address the 3 problems
mentioned in the description of this JIRA.
Thanks,
Alex
I think this would work for both 1560 and 1561.
if (p != null) { updateLastSend(); if ((p.requestHeader != null) && (p.requestHeader.getType() != OpCode.ping) && (p.requestHeader.getType() != OpCode.auth)) { p.requestHeader.setXid(cnxn.getXid()); } p.createBB(); ByteBuffer pbb = p.bb; ---> while (pbb.hasRemaining()) sock.write(pbb); ---> outgoingQueue.removeFirstOccurrence(p); sentCount++; if (p.requestHeader != null && p.requestHeader.getType() != OpCode.ping && p.requestHeader.getType() != OpCode.auth) { pending.add(p); } }
Patch v4 adopts what Igor and Jacky suggested.
Since Packet p is guaranteed to send all bytes in its ByteBuffer, I think problem 2 no longer exists.
ChrootClientTest and ClientTest both passed.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548889/zookeeper-1560-v4.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1215//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1215//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1215//console
This message is automatically generated.
2012-10-12 14:10:50,042 [myid:] - WARN [main-SendThread(localhost:11221):ClientCnxn$SendThread@1089] - Session 0x13a555031cf0000 for server localhost/127.0.0.1:11221, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Couldn't write 2000 bytes, 1152 bytes written at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:142) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:370) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 2012-10-12 14:10:50,044 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@349] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x13a555031cf0000, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:662)
Patch v5 adds more information to exception message.
Patch v6 changes the condition for raising IOE: if there is no progress between successive sock.write() calls.
I guess socket's output buffer might be a limiting factor as to the number of bytes written in a particular sock.write() call.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548893/zookeeper-1560-v5.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1216//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1216//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1216//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548898/zookeeper-1560-v6.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1217//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1217//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1217//console
This message is automatically generated.
It seems like in a particular iteration, 0 bytes is written:
localhost/127.0.0.1:11222, unexpected error, closing socket connection and attempting reconnect [exec] [junit] java.io.IOException: Couldn't write 2000 bytes, 0 bytes written in this iteration and 77152 bytes written in total. Original limit: 500074 [exec] [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:145) [exec] [junit] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:375) [exec] [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) [exec] [junit] 2012-10-12 15:20:42,629 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11222:NIOServerCnxn@349] - caught end of stream exception [exec] [junit] EndOfStreamException: Unable to read additional data from client sessionid 0x13a55902b650001, likely client has closed socket [exec] [junit] at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) [exec] [junit] at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) [exec] [junit] at java.lang.Thread.run(Thread.java:662) [exec] [junit] 2012-10-12 15:20:42,630 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11222:NIOServerCnxn@1001] - Closed socket connection for client /127.0.0.1:57126 which had sessionid 0x13a55902b650001
Seems like there's a strange resemblance among all the test failures thus far: always fails after 77152 bytes written.
Patch v7 changes the IOE to a warning.
Let's see if the test is able to make further progress.
I wonder whether 77152 bytes would be big enough for most use cases.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12548908/zookeeper-1560-v7.txt
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1218//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1218//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1218//console
This message is automatically generated.
Good news was that patch v7 passed.
Not so good news was that I didn't find any occurrence of the warning message I added in v7.
Essentially patch v7 is the same as patch v2 - we shouldn't bail if a single sock.write() call didn't make progress.
A couple questions/concerns:
- Is it safe to write out the whole packet in a while loop? Doing so essentially causes the thread to block until everything is written. This seems pretty dangerous to me since there's only a single SendThread – I think it would be safer to take the current approach of only doing one socket write per invocation of doIO. We would still need to address issue 1 in the description (packets are being removed from the queue even if they weren't fully written).
- What is the purpose of chunking the data? Why not write as much as possible?
- This was commented on earlier, but what's going on with issues 2 and 3 from the description? Addressing only issue 1 seems to fix the problem of sending large packets.
Thanks for the review.
I agree chunking is not needed.
it would be safer to take the current approach of only doing one socket write per invocation of doIO
If we leave Packet p in the outgoingQueue if there is data pending to be written and return from doIO, would the next invocation of doIO continue writing the remaining data from the same Packet ?
If so, what is the benefit of one socket write per invocation ?
Should we remove the Packet p and enqueue it to the tail of outgoingQueue if there is data pending ?
The benefit of one socket write per invocation is that it frees up the SendThread to perform other actions in between writes. For instance, it looks like the SendThread is also response for reading incoming data in doIO. I'm not sure what the consequences of having the thread block until the whole packet is written are; it might be ok, but if we can avoid looping it might save us some trouble later.
The next invocation of doIO should continue writing any remaining data from a previous packet (we don't want to mix up data from different packets). However, I'm not sure the current code will do this correctly – I think it's possible for another packet to be pushed onto the queue while another packet is in the middle of being written. I'm going to look into this further.
For problem #3, I only found one call to getXid() in doIO:
p.requestHeader.setXid(cnxn.getXid());
which is not in a loop. Some clarification would be nice.
That's a good point, the while loop in the patch seems like it would block when the tcp buffer is full (e.g. if the server is slow to read). I don't think that's a good idea. Rather we should have the code structured similar to what it was before - write as much as possible and then use the selector to wait for the socket to become writeable again. Eventually the send buffer will drain and we can remove it from the queue.
If no one can prepend outgoingQueue with packet, straightforward implementation like this should work:
diff --git a/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java b/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java index 70d8538..457c8cc 100644 --- a/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java +++ b/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java @@ -111,17 +111,20 @@ public class ClientCnxnSocketNIO extends ClientCnxnSocket { cnxn.sendThread.clientTunneledAuthenticationInProgress()); if (p != null) { - outgoingQueue.removeFirstOccurrence(p); updateLastSend(); - if ((p.requestHeader != null) && - (p.requestHeader.getType() != OpCode.ping) && - (p.requestHeader.getType() != OpCode.auth)) { - p.requestHeader.setXid(cnxn.getXid()); + if (p.bb != null) { + if ((p.requestHeader != null) && + (p.requestHeader.getType() != OpCode.ping) && + (p.requestHeader.getType() != OpCode.auth)) { + p.requestHeader.setXid(cnxn.getXid()); + } + p.createBB(); + // otherwise we're in the middle of sending packet } - p.createBB(); ByteBuffer pbb = p.bb; sock.write(pbb); if (!pbb.hasRemaining()) { + outgoingQueue.removeFirstOccurrence(p); sentCount++; if (p.requestHeader != null && p.requestHeader.getType() != OpCode.ping
Looking at createBB(), upon exit the field bb wouldn't be null.
I wonder why p.createBB() is enclosed in the if (p.bb != null) block above ?
similar to what it was before - write as much as possible and then use the selector to wait for the socket to become writeable again
I looked at svn log for ClientCnxnSocketNIO.java back to 2011-04-12 and didn't seem to find the above change.
FYI
ClientTest normally takes 36 seconds.
I ran ClientTest, twice, with Nikita's change:
[junit] Running org.apache.zookeeper.test.ClientTest [junit] Tests run: 12, Failures: 12, Errors: 0, Time elapsed: 340.748 sec [junit] Test org.apache.zookeeper.test.ClientTest FAILED
I then tried the following:
Index: src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java =================================================================== --- src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java (revision 1401900) +++ src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java (working copy) @@ -111,17 +111,20 @@ cnxn.sendThread.clientTunneledAuthenticationInProgress()); if (p != null) { - outgoingQueue.removeFirstOccurrence(p); updateLastSend(); if ((p.requestHeader != null) && - (p.requestHeader.getType() != OpCode.ping) && - (p.requestHeader.getType() != OpCode.auth)) { - p.requestHeader.setXid(cnxn.getXid()); + (p.requestHeader.getType() != OpCode.ping) && + (p.requestHeader.getType() != OpCode.auth)) { + p.requestHeader.setXid(cnxn.getXid()); } - p.createBB(); + if (p.bb == null) { + p.createBB(); + // otherwise we're in the middle of sending packet + } ByteBuffer pbb = p.bb; sock.write(pbb); if (!pbb.hasRemaining()) { + outgoingQueue.removeFirstOccurrence(p); sentCount++; if (p.requestHeader != null && p.requestHeader.getType() != OpCode.ping
testLargeNodeData encountered connection loss.
Should we remove the Packet p and enqueue it to the tail of outgoingQueue if there is data pending ?
The answer to my own question is no.
This would very likely result in connection loss / timeout when node data size is large.
PDH: similar to what it was before - write as much as possible and then use the selector to wait for the socket to become writeable again
Ted: I looked at svn log for ClientCnxnSocketNIO.java back to 2011-04-12 and didn't seem to find the above change. FYI
Hi Ted, the following is what I was referring to. This is from latest on branch-3.3, 3.4.4 has a similar (although broken) block where it's a bit less obvious what's happening. branch-3.3 is more clear.
Notice that we first attempt to write, if !remaining them we remove from the queue, otw we'll wait till the next time the selector wakes us up (the final isempty check is pretty criticial here as well to set interest correctly) and retry until the buffer is drained.
if (sockKey.isWritable()) { synchronized (outgoingQueue) { if (!outgoingQueue.isEmpty()) { ByteBuffer pbb = outgoingQueue.getFirst().bb; sock.write(pbb); if (!pbb.hasRemaining()) { sentCount++; Packet p = outgoingQueue.removeFirst(); if (p.header != null && p.header.getType() != OpCode.ping && p.header.getType() != OpCode.auth) { pendingQueue.add(p); } } } } } if (outgoingQueue.isEmpty()) { disableWrite(); } else { enableWrite(); }
I got the following based on the above code snippet:
Index: src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java =================================================================== --- src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java (revision 1401904) +++ src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java (working copy) @@ -111,18 +111,18 @@ cnxn.sendThread.clientTunneledAuthenticationInProgress()); if (p != null) { - outgoingQueue.removeFirstOccurrence(p); updateLastSend(); if ((p.requestHeader != null) && (p.requestHeader.getType() != OpCode.ping) && (p.requestHeader.getType() != OpCode.auth)) { p.requestHeader.setXid(cnxn.getXid()); } - p.createBB(); + if (p.bb == null) p.createBB(); ByteBuffer pbb = p.bb; sock.write(pbb); if (!pbb.hasRemaining()) { sentCount++; + outgoingQueue.removeFirstOccurrence(p); if (p.requestHeader != null && p.requestHeader.getType() != OpCode.ping && p.requestHeader.getType() != OpCode.auth) { @@ -141,8 +141,12 @@ synchronized(pendingQueue) { pendingQueue.addAll(pending); } - } + if (outgoingQueue.isEmpty()) { + disableWrite(); + } else { + enableWrite(); + } } private Packet findSendablePacket(LinkedList<Packet> outgoingQueue,
I still saw testLargeNodeData fail:
Testcase: testLargeNodeData took 0.714 sec Caused an ERROR KeeperErrorCode = ConnectionLoss for /large org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /large at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.zookeeper.test.ClientTest.testLargeNodeData(ClientTest.java:61)
For problem #3, I only found one call to getXid() in doIO:
p.requestHeader.setXid(cnxn.getXid());which is not in a loop. Some clarification would be nice.
It's in the outer loop, so to speak. If the packet is large and is sent in chunks, Xid is incremented for every chunk. Before ZOOKEEPER-1437 it was incremented for every packet.
I've created a new patch (ZOOKEEPER-1560-v8.patch) that incorporates what we have so far (moving removeFirstOccurrence to after the packet is completely written, only calling createBB when a BB doesn't already exist, and only calling setXid when no xid is already set). It also modifies findSendablePacket to always choose the first packet if it is partially written. The only place that a packet is prepended to outgoingQueue is ClientCnxn.primeConnection, which should only happen at the very beginning, so a partially-written packet should remain at the beginning of the queue until it is removed. I also cleaned up some of the code so the changes look more extensive than they really are Posted at https://reviews.apache.org/r/7730. I added comments to mark the important parts (as opposed to the clean up).
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12550725/ZOOKEEPER-1560-v8.patch
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1238//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1238//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1238//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12551039/ZOOKEEPER-1560-v8_r3.patch
against trunk revision 1391526.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1241//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1241//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1241//console
This message is automatically generated.
In patch v8, I don't see special handling for cnxn.getXid()
Hi Ted, I'm not clear, what do you mean by that comment?
It was about 'the third issue is caused by extra calls to cnxn.getXid() that increment xid on every iteration and confuse the server.'
In previous patch, there was check whether Xid is 0 before calling p.requestHeader.setXid().
Rather than checking if the xid is 0, the new patch moves the xid code into the if (p.bb == null) block. We're essentially checking whether we've already started writing the packet, and if we haven't, we call createBB and setXid. Since we always do these two actions under the same condition (i.e., when we see a new packet), there's no reason to have two separate checks.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12551585/ZOOKEEPER-1560-v8_r4.patch
against trunk revision 1404260.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1242//console
This message is automatically generated.
I think Patrick forgot to announce the integration of your prior patch:
r1404260 | phunt | 2012-10-31 10:24:54 -0700 (Wed, 31 Oct 2012) | 1 line ZOOKEEPER-1560 Zookeeper client hangs on creation of large nodes (Skye Wanderman-Milne via phunt)
Things take time.
I was waiting for Skye to update the patch to include the missing test. It's failing to apply now because the most recent patch is now partially committed to trunk. Working on it.
Integrated in ZooKeeper-trunk #1728 (See https://builds.apache.org/job/ZooKeeper-trunk/1728/)
ZOOKEEPER-1560 Zookeeper client hangs on creation of large nodes. Added missing test. (Skye Wanderman-Milne via phunt) (Revision 1404288)
ZOOKEEPER-1560 Zookeeper client hangs on creation of large nodes (Skye Wanderman-Milne via phunt) (Revision 1404260)
Result = SUCCESS
phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1404288
Files :
- /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientTest.java
phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1404260
Files :
- /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java
Test that reproduces that issue.