Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.10.0
-
None
-
None
Description
*Underlying Problem*
There is a bug in CentOS 7.3 (and RHEL 7.3) related to reloading of the SSH daemon configuration. See:
- RHEL bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1381997 (which gives the best description of the underlying problem in https://bugzilla.redhat.com/show_bug.cgi?id=1381997#c4).
- CentOS bug report at https://bugs.centos.org/view.php?id=12757
- The change introduced the problematic behavior at https://bugzilla.redhat.com/show_bug.cgi?id=1291172
*Brooklyn Code calling sshd reload*
The problems occurs when jclouds (or org.apache.brooklyn.policy.jclouds.os.CreateUserPolicy) modifies /etc/ssh/sshd_config and then calls service sshd reload. This can leave systemd's view of sshd in a bad state. A subsequent call to service sshd reload or service sshd restart will not have the desired effect!
The jclouds code calls service sshd reload if it is configured to lock-down sshd, which is the default - it will configure PasswordAuthentication no and PermitRootLogin no (see https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/location/jclouds/JcloudsLocation.java#L2293).
Subsequent use of CreateUserPolicy will also call service sshd reload (see https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/policy/jclouds/os/CreateUserPolicy.java#L159-L164).
*Observed Behaviour: initial jclouds provisioning*
The exact behaviour we see on CentOS 7.3 is non-deterministic. Below we attempt to describe what can happen.
TL;DR: systemd thinks sshd is inactive/dead, but the process is really running.
On a newly provisioned CentOS 7.3 VM, immediately after we've provisioned it through jclouds...
Looking at the output from jclouds (having tweaked it so that the stderr isn't lost), it executed:
exec 3<> /etc/ssh/sshd_config && awk -v TEXT="PasswordAuthentication no PermitRootLogin no " 'BEGIN {print TEXT}{print}' /etc/ssh/sshd_config >&3 hash service && service ssh reload || service sshd reload || /etc/init.d/ssh* reload
The relevant stderr is:
Redirecting to /bin/systemctl reload sshd.service Redirecting to /bin/systemctl status sshd.service
The status of sshd (according the systemd) is:
$ service sshd status Redirecting to /bin/systemctl status sshd.service * sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 5s ago Docs: man:sshd(8) man:sshd_config(5) Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 10348 (code=exited, status=0/SUCCESS) CGroup: /system.slice/sshd.service └─28942 /usr/sbin/sshd
However, the process is really running and we can ssh to it:
$ cat /var/run/sshd.pid 28942 $ ps aux | grep "/usr/sbin/sshd" root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd
*Observed Behaviour: sshd reload again*
TL;DR: running reload again fails - the process keeps running, but does not re-load the sshd_config file.
After running service sshd reload, if the same command is run again we get:
$ sudo service sshd reload Redirecting to /bin/systemctl reload sshd.service Job for sshd.service invalid. $ echo $? 1 $ cat /var/run/sshd.pid 28942 $ ps aux | grep "/usr/sbin/sshd" root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd $ service sshd status Redirecting to /bin/systemctl status sshd.service * sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 1min 9s ago Docs: man:sshd(8) man:sshd_config(5) Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 10348 (code=exited, status=0/SUCCESS) CGroup: /system.slice/sshd.service └─28942 /usr/sbin/sshd
In /var/log/messages, we have:
Feb 8 15:17:23 host-10-104-0-188 systemd: Unit sshd.service cannot be reloaded because it is inactive.
*Observed Behaviour: restart, after a failed reload*
TL;DR: systemd does not restart the process (same pid) - the existing process is now recognised as running though, but the /etc/ssh/sshd_config has not be reloaded.
$ sudo service sshd restart Redirecting to /bin/systemctl restart sshd.service $ echo $? 0 $ cat /var/run/sshd.pid 28942 $ ps aux | grep "/usr/sbin/sshd" root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd $ service sshd status Redirecting to /bin/systemctl status sshd.service * sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2017-02-08 15:21:39 UTC; 5s ago Docs: man:sshd(8) man:sshd_config(5) Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 29726 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 28942 (sshd) CGroup: /system.slice/sshd.service └─28942 /usr/sbin/sshd Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Starting OpenSSH server daemon... Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Started OpenSSH server daemon.
In /var/log/messages, we have:
Feb 8 15:21:39 host-10-104-0-188 systemd: Starting OpenSSH server daemon... Feb 8 15:21:39 host-10-104-0-188 systemd: Started OpenSSH server daemon.
And in /var/log/secure:
Feb 8 15:21:39 host-10-104-0-188 sudo: amp : TTY=pts/0 ; PWD=/home/users/amp ; USER=root ; COMMAND=/usr/sbin/service sshd restart Feb 8 15:21:39 host-10-104-0-188 polkitd[497]: Registered Authentication Agent for unix-process:29709:569722 (system bus name :1.29 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) Feb 8 15:21:39 host-10-104-0-188 polkitd[497]: Unregistered Authentication Agent for unix-process:29709:569722 (system bus name :1.29, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) (disconnected from bus) Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use. Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on :: failed: Address already in use. Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: fatal: Cannot bind any address.
My wild guess at what has happened here is...
1. systemd thinks sshd is not running, so did not stop it.
2. systemd starts another sshd process, which fails because process 28942 is already running and listening on port 22.
3. systemd looks at the pid file, and finds 28942 written there - that sshd process is running, so it reports sshd as "active (running)".
The sshd process (28942) has not re-loaded the sshd_config file, so changes will not have taken effect.
*Observed Behaviour: stop + start, after a failed reload*
Note that calling service sshd stop; service sshd start after a bad reload will behave in the same way as the service sshd restart.
*Observed Behaviour: subsequent restart, after failed restart*
A subsequent call to restart now succeeds as expected. The process is replaced, and the changes made in sshd_config take effect.
Attachments
Issue Links
- relates to
-
JCLOUDS-1236 Provisioning of CentOS 7.3 VM can leave sshd in bad state
- Open