[THRIFT-1423] Binary transport break up by overlapping of TFrame headers in scribe.AsyncClient - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.5, 0.6, 0.6.1, 0.7
Fix Version/s: 0.13.0
Component/s: Java - Library
Labels:
- async
- client
- java
- nio
Environment:

Hide

Oracle Java JDK 1.6.0_27-b07, Ubuntu Server, Amazon EC2 large.instance
Scribe server on local host using libthrift 0.5, TNonBlockingServer, TFramedTransport
Java client on localhost with libthrift 0.5, scribe.AsyncClient, 1 long-lived connection. Client is multithreaded and has persistent connection to scribe.

Show
Oracle Java JDK 1.6.0_27-b07, Ubuntu Server, Amazon EC2 large.instance Scribe server on local host using libthrift 0.5, TNonBlockingServer, TFramedTransport Java client on localhost with libthrift 0.5, scribe.AsyncClient, 1 long-lived connection. Client is multithreaded and has persistent connection to scribe.

Description

I have got strange behavior of async client in libthrift. Building up described configuration on ec2, i have got a lot of moments when thrift server says that it has broken transport. I thought it is scribe problem. But digging inside of tcpdump i have found one strange thing with thrift async client (scribe.AsyncClient).

There is no doubt that async client support for multi-threaded access. Yep, it is implemented as single threaded fst. But it has concurrent incoming pending calls queue as you know.

So, what i have found and that is the source of connections transports breaking off:

17:10:17.815425 IP (tos 0x0, ttl 64, id 38164, offset 0, flags [DF], proto TCP (6), length 56)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 0x1371), seq 1110:1114, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34122685], length 4
0x0000: 4500 0038 9514 4000 4006 a7a9 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 1499 747b 6961
0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc
0x0030: 0208 abbd 0000 003a
17:10:17.815534 IP (tos 0x0, ttl 64, id 38165, offset 0, flags [DF], proto TCP (6), length 56)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 0x134b), seq 1114:1118, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34122685], length 4
0x0000: 4500 0038 9515 4000 4006 a7a8 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 149d 747b 6961
0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc
0x0030: 0208 abbd 0000 005c
17:10:17.815648 IP (tos 0x0, ttl 64, id 60631, offset 0, flags [DF], proto TCP (6), length 52)
localhost.1463 > localhost.39412: Flags [.], cksum 0xfe28 (incorrect -> 0x1171), seq 337, ack 1118, win 256, options [nop,nop,TS val 34123260 ecr 34123260], length 0
0x0000: 4500 0034 ecd7 4000 4006 4fea 7f00 0001
0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14a1
0x0020: 8010 0100 fe28 0000 0101 080a 0208 adfc
0x0030: 0208 adfc
17:10:17.815658 IP (tos 0x0, ttl 64, id 38166, offset 0, flags [DF], proto TCP (6), length 144)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe84 (incorrect -> 0xe7af), seq 1118:1210, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34123260], length 92
0x0000: 4500 0090 9516 4000 4006 a74f 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 14a1 747b 6961
0x0020: 8018 0101 fe84 0000 0101 080a 0208 adfc
0x0030: 0208 adfc 8001 0001 0000 0003 4c6f 6700
0x0040: 0000 000f 0001 0c00 0000 010b 0001 0000
0x0050: 0005 6974 656d 730b 0002 0000 0030 2863
0x0060: 6c75 622e 7573 652c 2036 3031 322c 2030
0x0070: 2c20 302c 2032 3830 2c20 302c 2074 7275
0x0080: 652c 2031 3332 3035 3838 3631 3729 0000
17:10:17.815971 IP (tos 0x0, ttl 64, id 60632, offset 0, flags [DF], proto TCP (6), length 52)
localhost.1463 > localhost.39412: Flags [R.], cksum 0xfe28 (incorrect -> 0x1111), seq 337, ack 1210, win 256, options [nop,nop,TS val 34123260 ecr 34123260], length 0
0x0000: 4500 0034 ecd8 4000 4006 4fe9 7f00 0001
0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14fd
0x0020: 8014 0100 fe28 0000 0101 080a 0208 adfc
0x0030: 0208 adfc

There are two sequentional tframedtransport headers/packets in the dump. then ack from scribe. then psh of Log_call. it is clearly ok that scribe said it is broken transport caused by tframes are mixed. 1 packet is TFrame data size for message = 3a, 2 packet is TFrame data size = 5c. Third is ACK. Forth is Log_call length of 5c. Fifth packed is ACK with RST from scribe cause it found error in binary stream. There is no Log_call packet of length 3a between two first packets. Two sequentional tframe headers are in separate packets from it's data only because of TCP_NODELAY and separated channel.write(TFrameSizeBuffer) and channel.write(TFrameDataBuffer).

Why there are separated states and buffers to write BODY_SIZE and BODY_DATA in the client (TAsyncMethodCall)? Why there are separated states for READS? The only thing the BODY_SIZE is for, is emulation of TFramedTransport, but why it writes it's header asynchronously and separatelly from BODY_DATA? Sync version of client writes TFrameHeader and it's data in one single TCP packet, not as async client. Any way, this follows transport to mess it's data.

Am i right this is bug of implementation of tframed support in async client?

Attachments

Issue Links

duplicates

THRIFT-4714 Java TFramedTransport calls write twice for each flush

Closed

Binary transport break up by overlapping of TFrame headers in scribe.AsyncClient

Details

Description

Attachments

Issue Links

Activity

People

Dates