How systemd starts services: the bare essentials
This post starts with a warning: This isn't an in-depth treaty about systemd, but that's on purpose. First, no in-depth systemd post should be written this close to midnight, or you'll have bad dreams. And second, I've found that even just this tiny piece of knowledge of systemd has already provided me with some comfort that it's not that impossible to grok. So if you're like me, an Ubuntu user who started in the Slackware days, and who is somewhat terrified of not understanding how services are started by this new incarnation of init
, read on.
Preamble
For some background, the problem I ran into was that BIND was flooding my logs with IPv6 failures:
Jul 9 06:26:56 anthem named[29348]: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53
That's totally appropriate, as this network doesn't have IPv6 connectivity. And I had previously silenced this (as others have too) by adding -4
to the commandline options used to start named
, so what was going on?
Well, first, that option was definitely not being passed:
ps auxw | grep named
bind 19165 0.3 1.1 451316 44664 ? Ssl 17:31 1:06 /usr/sbin/named -f -u bind
And it was definitely being specified in /etc/default/bind9
:
# run resolvconf?
RESOLVCONF=yes
# additional startup options for the server
OPTIONS="-u bind -4"
What's up with that? It turns out it's a bug inherited from Debian and that is yet to be fixed in Xenial (UPDATE: just fixed, in time for 18.04 LTS!), and what follows is what I had to learn in order to figure that out.
(The reason a few of these ignore-etc-default bugs have cropped up is that it turns out systemd upstream doesn't like the idea of having to spawn a shell just to read options from /etc/default/foo, and it's a bit of extra work for a Debian package maintainer to, when upgrading a service to systemd, parse /etc/default and set up systemd to override the options provided to the command.)
"initscripts" in systemd
systemd starts services by parsing unit files; the system-provided versions live in /lib/systemd/system/
, and that's where I found the bind9 unit:
[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network.target
[Service]
ExecStart=/usr/sbin/named -f -u bind
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop
[Install]
WantedBy=multi-user.target
Hah, there's nothing about parsing /etc/default/bind9
there, so of course my option is being ignored.
Now systemd lets you override how services are started up by putting them in /etc
, so to test a fix I just copied the file over and changed it:
# cd /etc/systemd/system
# cp /lib/systemd/system/bind9.service .
# sed -i 's/named -f/named -4 -f/' bind9
But when I ran systemctl start bind
again, I noticed that the option was still not there. Hmm.
The status
output shows in its second line what's wrong:
● bind9.service - BIND Domain Name Server
Loaded: loaded (/lib/systemd/system/bind9.service; enabled; vendor preset: enabled)
It's not looking at /etc
. Interestingly, systemd caches its configuration, so to get it to re-read the files you need to issue systemctl daemon-reload
.
Once you've done that, status
will show you the right output:
● bind9.service - BIND Domain Name Server
Loaded: loaded (/etc/systemd/system/bind9.service; enabled; vendor preset: enabled)
and behold, it works:
bind 19165 0.3 1.1 451316 44664 ? Ssl 17:31 1:08 /usr/sbin/named -4 -f -u bind
Partial overrides through drop-ins
You can also override a specific stanza by populating a file in the bind9.service.d
directory; for instance, instead of copying the whole unit file, you could just:
# mkdir /etc/systemd/system/bind9.service.d
# vi /etc/systemd/system/bind9.service.d/no-ipv6.conf
providing the following contents:
[Service]
ExecStart=
ExecStart=/usr/sbin/named -f -u bind -4
That gets read and is applied over what other unit files (in both /lib
and /etc
) for the service specify.
Note that ExecStart
is oddly specified twice, the first with an empty RHS. That's because ExecStart
is parsed as a list (I suppose to allow executing multiple processes in a single unit file), and in our case we want to override the single named
command run, so we need to clear the list first. See Example 2 of the systemd.unit
manpage[1] and this great AskUbuntu post for more detail.
is-system-running, and degraded mode
systemd has a somewhat confusing, but useful command to check overall status:
# systemctl is-system-running
running
The meaning of the output surprised me: running means the system is up and all services came up correctly. If one or more services failed to start, you'll see this instead:
# systemctl is-system-running
degraded
Surprisingly, I'm not the first person to stumble upon this command, run it, and discover something that I should have fixed. This post on reddit showed me what to do when that happened on one of my servers:
# systemctl --failed
smartd
It turns out smartd
was completely failing to run:
# systemctl status smartd
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/lib/systemd/system/smartd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2017-07-13 22:27:02 BRT; 47s ago
Docs: man:smartd(8)
man:smartd.conf(5)
Process: 12542 ExecStart=/usr/sbin/smartd -n $smartd_opts (code=exited, status=17)
Main PID: 12542 (code=exited, status=17)
Jul 13 22:27:02 chorus systemd[1]: Started Self Monitoring and Reporting Technology (SMART) Daemon.
Jul 13 22:27:02 chorus smartd[12542]: smartd 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-83-generic] (local build)
Jul 13 22:27:02 chorus smartd[12542]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 13 22:27:02 chorus smartd[12542]: Opened configuration file /etc/smartd.conf
Jul 13 22:27:02 chorus smartd[12542]: Configuration file /etc/smartd.conf parsed.
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Unit entered failed state.
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Failed with result 'exit-code'.
Okay, it failed, but it doesn't tell me much about why. To dig in, you can either look at the logs, or use journalctl
:
journalctl -u smartd.service
In the output, nicely colored in red, I got
Jul 13 19:02:39 chorus smartd[804]: Unable to register ATA device /dev/sda at line 23 of file /etc/smartd.conf
Jul 13 19:02:39 chorus smartd[804]: Unable to register device /dev/sda (no Directive -d removable). Exiting.
All of this, because the config file had a bad path and I'd never bothered to fix it when I replaced the drives on this machine. Oops!
Finally, there are a bunch of possible states in the systemctl
manpage's Table 2 for is-system-running
, but you're only likely to see running
, degraded
or maintenance
in real life; the rest are transient states that happen during startup or shutdown (or weirder things).
Thanks so much to Steve Langasek for giving me some intial hints, and until the next time.
I'm not sure why, but the CoreOS docs for drop-in units currently copy verbatim from the manpage the stuff about list settings, but omit the helpful example. Even worse, their SEO beats the manpage! ↩︎