[CASSANDRA-16531] cqlsh can stuck randomly during a connect-timeout period in Kubernetes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 3.11.x
Component/s: Tool/cqlsh
Labels:
None

Bug Category:
Availability
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.11.5

Description

Hello, we get the following issue for cqlsh in Kubernetes environment.

When cqlsh is executed against a head-less Cassandra service (which is configured for Cassandra pods) - it can stuck randomly during a connect-timeout period.
We reproduced the issue using the following command:

for i in {1..1000}; do cqlsh --debug --connect-timeout="600" -u cassandra -p cassandra cassandra.env-1-cassandra.svc 9042 -k keyspace1_test -e "select cql_version from system.local"; sleep 1; done

if a specific Cassandra node address or IP is provided instead of service DNS name - the issue is not reproduced.
Additional observation - stuck is happening in cqlsh after printing a result of select, so the select query itself is executed normally.

The following logic was added to cqlsh.py to print debug logs in Cassandra python driver:

import logging

logging.basicConfig(filename='/tmp/cqlsh.log', level=logging.DEBUG, format = '%(asctime)s | %(levelname)-8s | %(threadName)s | %(module)s:%(name)s | %(lineno)04d | %(message)s')
logging.info('=================================================')

The log captured for the cqlsh invocation when the issue is reproduced: cqlsh_debug.txt
According the log - cassandra.env-1-cassandra.svc provided as a contact point to cqlsh is resolved immediately to 3 node IPs, before any discovery queries to Cassandra servers:

2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.129.19.19 is now marked up
2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.129.205.119 is now marked up
2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.130.169.162 is now marked up

the correspondent logic in cassandra/cluster.py from Cassandra python driver:

self.contact_points_resolved = [endpoint[4][0] for a in self.contact_points
                                                   for endpoint in socket.getaddrinfo(a, self.port, socket.AF_UNSPEC, socket.SOCK_STREAM)]

So, the driver resolves all IPs for the DNS name, like:

nslookup cassandra.env-1-cassandra
Server:         172.30.0.10
Address:        172.30.0.10#53

Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.129.205.119
Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.129.38.98
Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.130.169.162

After it the driver uses all of the IPs as contact points.
Several threads are used to work with connections in the driver, Thread-1 and Thread-2 are used to handle events related to connection state change and execute re-connects.
Once cqlsh executed a query and got a result - it triggers a graceful shutdown of the driver instance.
The shutdown logic is closing connections:

2021-03-18 21:37:14,140 | DEBUG    | Dummy-3 | asyncorereactor:cassandra.io.asyncorereactor | 0368 | Closing connection (139784430944720) to 10.129.205.119
2021-03-18 21:37:14,140 | DEBUG    | Dummy-3 | asyncorereactor:cassandra.io.asyncorereactor | 0375 | Closed socket to 10.129.205.119

At this moment of time in parallel the logic related to host events handling decides to re-connect for a host sometimes (race condition):

2021-03-18 21:37:14,145 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2792 | [control connection] Attempting to reconnect
2021-03-18 21:37:14,145 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2742 | [control connection] Opening new connection to 10.129.205.119

The thread starts to wait for connect timeout to initiate a re-connection.

2021-03-18 21:37:14,194 | DEBUG    | Dummy-3 | cluster:cassandra.cluster | 3245 | Shutting down Cluster Scheduler
2021-03-18 21:37:14,245 | DEBUG    | Dummy-3 | cluster:cassandra.cluster | 2848 | Shutting down control connection

<--- stuck is here -->

2021-03-18 21:47:14,146 | DEBUG    | Thread-2 | connection:cassandra.connection | 0639 | Not sending options message for new connection(139784430946896) to 10.129.205.119 because compression is disabled and a cql version was not specified
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0712 | Sending StartupMessage on <AsyncoreConnection(139784430946896) 10.129.205.119:9042>
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0720 | Sent StartupMessage on <AsyncoreConnection(139784430946896) 10.129.205.119:9042>
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | asyncorereactor:cassandra.io.asyncorereactor | 0368 | Closing connection (139784430946896) to 10.129.205.119
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | asyncorereactor:cassandra.io.asyncorereactor | 0375 | Closed socket to 10.129.205.119
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0761 | Connection to 10.129.205.119 was closed during the startup handshake
2021-03-18 21:47:14,147 | WARNING  | Thread-2 | cluster:cassandra.cluster | 2731 | [control connection] Error connecting to 10.129.205.119:
Traceback (most recent call last):
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2724, in _reconnect_internal
    return self._try_connect(host)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2746, in _try_connect
    connection = self._cluster.connection_factory(host.address, is_control_connection=True)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1138, in connection_factory
    return self.connection_class.factory(address, self.connect_timeout, *args, **kwargs)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/connection.py", line 341, in factory
    raise OperationTimedOut("Timed out creating connection (%s seconds)" % timeout)
OperationTimedOut: errors=Timed out creating connection (600 seconds), last_host=None

After it - it realises that the connection is closed due to shutdown and the re-connect task is finally completed:

2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2814 | [control connection] error reconnecting
Traceback (most recent call last):
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2794, in _reconnect
    self._set_new_connection(self._reconnect_internal())
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2733, in _reconnect_internal
    raise DriverException("[control connection] Reconnection in progress during shutdown")
DriverException: [control connection] Reconnection in progress during shutdown

Once task is completed - the thread pool with Thread-1 and Thread-2 is stopped and cqlsh execution is terminated.

The same is visible in thread dumps captured for cqlsh ( pystack-2868.txt ).
The thread which is trying to re-connect (waits for a timeout to re-connect):

File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2794, in _reconnect
 self._set_new_connection(self._reconnect_internal())
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2724, in _reconnect_internal
 return self._try_connect(host)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2746, in _try_connect
 connection = self._cluster.connection_factory(host.address, is_control_connection=True)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1138, in connection_factory
 return self.connection_class.factory(address, self.connect_timeout, *args, **kwargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/connection.py", line 332, in factory
 conn = cls(host, *args, **kwargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py", line 355, in __init__
 init_handler.wait(kwargs["connect_timeout"])
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py", line 77, in wait
 self.event.wait(timeout)
 File "/usr/local/lib/python2.7/threading.py", line 614, in wait
 self.__cond.wait(timeout)
 File "/usr/local/lib/python2.7/threading.py", line 359, in wait
 _sleep(delay)

The thread is waiting for thread pool (Cluster.executor field) shutdown to complete:

 File "/usr/local/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 func(*targs, **kargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 181, in _shutdown_clusters
 cluster.shutdown()
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1264, in shutdown
 self.executor.shutdown()
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/futures-2.1.6-py2.py3-none-any.zip/concurrent/futures/thread.py", line 137, in shutdown
 t.join()
 File "/usr/local/lib/python2.7/threading.py", line 940, in join
 self.__block.wait()
 File "/usr/local/lib/python2.7/threading.py", line 340, in wait
 waiter.acquire()
 File "<string>", line 1, in <module>
 File "<string>", line 1, in <module>

The difference between the case when a single IP/hostname is specified and a service name - how many IPs are resolved and how many contact points are used to establish a control connection during startup.

So, it looks like the root cause of the delay is a race condition in the Cassandra python driver within cqlsh tool, the race condition is happening on shutdown for the cases when a DNS name as a contact point to cqlsh is resolved to multiple IPs.

The issue was found in Cassandra 3.11.5 but the same is actual for 3.11.10 (python driver build is the same - cassandra-driver-3.11.0-bb96859b).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pystack-2868.txt
22/Mar/21 12:42
3 kB
Dmitry Konstantinov
cqlsh_debug.txt
22/Mar/21 12:37
10 kB
Dmitry Konstantinov

Issue Links

links to

[PYTHON-1274] [cqlsh] Driver shutdown can randomly stuck for connect-timeout period

Activity

People

Assignee:: Unassigned

Reporter:: Dmitry Konstantinov

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Mar/21 12:44

Updated:: 23/Mar/21 14:39