Description
Today our CI images at Apache Airflow started to fail , and when we investigated, the root cause seems to be that Cassandra 3.0 image in our CI jobs failed to start (and pass health checks). Usually we have one of our tests bring up a number of images via docker compose and we used "cassandra:3.0" image for that. We noticed 3.0.26 was released 15 hours ago so this is almost for sure some 3.0.25 -> 3.0.26 difference.
The whole tests fails because cassandra container is unhealthy:
https://github.com/apache/airflow/runs/6320170343?check_suite_focus=true#step:10:6651
https://github.com/apache/airflow/runs/6319805534?check_suite_focus=true#step:10:12629
https://github.com/apache/airflow/runs/6319710486?check_suite_focus=true#step:10:6759
ERROR: for airflow Container "3bd115315ba7" is unhealthy.
Encountered errors while bringing up the project.
3bd115315ba7 cassandra:3.0 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes (unhealthy) 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp airflow-integration-postgres_cassandra_1
The errors from the cassandra container do not show anything suspicious:
INFO 08:45:22 Using Netty Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a, netty-codec=netty-codec-4.0.44.Final.452812a, netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, netty-codec-http=netty-codec-http-4.0.44.Final.452812a, netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, netty-common=netty-common-4.0.44.Final.452812a, netty-handler=netty-handler-4.0.44.Final.452812a, netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, netty-transport=netty-transport-4.0.44.Final.452812a, netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO 08:45:22 Starting listening for CQL clients on /0.0.0.0:9042 (unencrypted)...
INFO 08:45:23 Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO 08:45:23 Startup complete
INFO 08:45:24 Created default superuser role ‘cassandra’
Our docker-compose entry is here:
https://github.com/apache/airflow/blob/main/scripts/ci/docker-compose/integration-cassandra.yml
Basically - we run healthcheck that checks if cassandra is up and this health check worked fine before, but seems to fail now. It's either we are using wrong healthcheck or there is some bug in the command ?:
healthcheck:
test: "[ $$(nodetool statusgossip) = running ]"
interval: 5s
timeout: 30s
retries: 50
restart: always
We mitigated it by switching to 3.0.25 temporarily https://github.com/apache/airflow/pull/23522
Is this an error in cassandra? Or should we maybe change our health-check command?
Attachments
Attachments
Issue Links
- duplicates
-
CASSANDRA-17581 nodetool with Java 8u331 returns "URISyntaxException: 'Malformed IPv6 address at index 7: rmi://[127.0.0.1]:7199'"
- Resolved