[PROTON-1727] [epoll proactor] segfaults, hangs and leaked FDs around failed connect - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: proton-c-0.18.1
Fix Version/s: proton-c-0.20.0
Component/s: proton-c
Labels:
None

Description

There is a race condition that causes leaked FDs and segfaults in the epoll proactor under the following conditions:

there is more than one thread processing proactor events.
attempting to connect to a host address that resolves to multiple socket addresses, e.g. resolving the NULL hostname on a machine with ipv4 and ipv6 enabled.
there is nothing listening on the target port.

The attached reproducer shows several bad behaviors:

under rr or valgrind (--tool=memcheck and --tool=helgrind) it quickly (< 1min) shows race conditions and/or invalid memory access.
it hangs fairly often even without valgrind/rr, more so if you increase the thread count. Without valgrind/rr it rarely segfaults.
it leaks FDs - the test should run forever, but runs out of FDs around 1024 iterations.

This is probably the cause of https://issues.apache.org/jira/browse/DISPATCH-902, which does occur very frequently under the conditions described there.

The test program should run forever without leaking or showing any faults.

Note that gcc -fsantize does not detect races or memory errors, which suggests the bug requires a delay at the right time to manifest. Valgrind's overhead and rr's code serialization appears to provide that delay. It seems likely that dispatch's reconnect logic is providing the delay in ~~DISPATCH-902~~.