There is a race condition that causes leaked FDs and segfaults in the epoll proactor under the following conditions:
- there is more than one thread processing proactor events.
- attempting to connect to a host address that resolves to multiple socket addresses, e.g. resolving the NULL hostname on a machine with ipv4 and ipv6 enabled.
- there is nothing listening on the target port.
The attached reproducer shows several bad behaviors:
- under rr or valgrind (--tool=memcheck and --tool=helgrind) it quickly (< 1min) shows race conditions and/or invalid memory access.
- it hangs fairly often even without valgrind/rr, more so if you increase the thread count. Without valgrind/rr it rarely segfaults.
- it leaks FDs - the test should run forever, but runs out of FDs around 1024 iterations.
This is probably the cause of https://issues.apache.org/jira/browse/DISPATCH-902, which does occur very frequently under the conditions described there.
The test program should run forever without leaking or showing any faults.
Note that gcc -fsantize does not detect races or memory errors, which suggests the bug requires a delay at the right time to manifest. Valgrind's overhead and rr's code serialization appears to provide that delay. It seems likely that dispatch's reconnect logic is providing the delay in