How systemd starts services: the bare essentials

This post starts with a warning: This isn't an in-depth treaty about systemd, but that's on purpose. First, no in-depth systemd post should be written this close to midnight, or you'll have bad dreams. And second, I've found that even just this tiny piece of knowledge of systemd has already provided me with some comfort that it's not that impossible to grok. So if you're like me, an Ubuntu user who started in the Slackware days, and who is somewhat terrified of not understanding how services are started by this new incarnation of init, read on.

Preamble

For some background, the problem I ran into was that BIND was flooding my logs with IPv6 failures:

Jul  9 06:26:56 anthem named[29348]: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53

That's totally appropriate, as this network doesn't have IPv6 connectivity. And I had previously silenced this (as others have too) by adding -4 to the commandline options used to start named, so what was going on?

Well, first, that option was definitely not being passed:

ps auxw | grep named
bind     19165  0.3  1.1 451316 44664 ?        Ssl  17:31   1:06 /usr/sbin/named -f -u bind

And it was definitely being specified in /etc/default/bind9:

# run resolvconf?
RESOLVCONF=yes
# additional startup options for the server
OPTIONS="-u bind -4"

What's up with that? It turns out it's a bug inherited from Debian and that is yet to be fixed in Xenial (UPDATE: just fixed, in time for 18.04 LTS!), and what follows is what I had to learn in order to figure that out.

(The reason a few of these ignore-etc-default bugs have cropped up is that it turns out systemd upstream doesn't like the idea of having to spawn a shell just to read options from /etc/default/foo, and it's a bit of extra work for a Debian package maintainer to, when upgrading a service to systemd, parse /etc/default and set up systemd to override the options provided to the command.)

"initscripts" in systemd

systemd starts services by parsing unit files; the system-provided versions live in /lib/systemd/system/, and that's where I found the bind9 unit:

[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network.target

[Service]
ExecStart=/usr/sbin/named -f -u bind
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop

[Install]
WantedBy=multi-user.target

Hah, there's nothing about parsing /etc/default/bind9 there, so of course my option is being ignored.

Now systemd lets you override how services are started up by putting them in /etc, so to test a fix I just copied the file over and changed it:

# cd /etc/systemd/system
# cp /lib/systemd/system/bind9.service .
# sed -i 's/named -f/named -4 -f/' bind9 

But when I ran systemctl start bind again, I noticed that the option was still not there. Hmm.

The status output shows in its second line what's wrong:

 ● bind9.service - BIND Domain Name Server
   Loaded: loaded (/lib/systemd/system/bind9.service; enabled; vendor preset: enabled)

It's not looking at /etc. Interestingly, systemd caches its configuration, so to get it to re-read the files you need to issue systemctl daemon-reload.

Once you've done that, status will show you the right output:

● bind9.service - BIND Domain Name Server
   Loaded: loaded (/etc/systemd/system/bind9.service; enabled; vendor preset: enabled)

and behold, it works:

bind     19165  0.3  1.1 451316 44664 ?        Ssl  17:31   1:08 /usr/sbin/named -4 -f -u bind

Partial overrides through drop-ins

You can also override a specific stanza by populating a file in the bind9.service.d directory; for instance, instead of copying the whole unit file, you could just:

# mkdir /etc/systemd/system/bind9.service.d
# vi /etc/systemd/system/bind9.service.d/no-ipv6.conf

providing the following contents:

[Service]
ExecStart=
ExecStart=/usr/sbin/named -f -u bind -4

That gets read and is applied over what other unit files (in both /lib and /etc) for the service specify.

Note that ExecStart is oddly specified twice, the first with an empty RHS. That's because ExecStart is parsed as a list (I suppose to allow executing multiple processes in a single unit file), and in our case we want to override the single named command run, so we need to clear the list first. See Example 2 of the systemd.unit manpage[1] and this great AskUbuntu post for more detail.

is-system-running, and degraded mode

systemd has a somewhat confusing, but useful command to check overall status:

# systemctl is-system-running
running

The meaning of the output surprised me: running means the system is up and all services came up correctly. If one or more services failed to start, you'll see this instead:

# systemctl is-system-running
degraded

Surprisingly, I'm not the first person to stumble upon this command, run it, and discover something that I should have fixed. This post on reddit showed me what to do when that happened on one of my servers:

# systemctl --failed
smartd

It turns out smartd was completely failing to run:

# systemctl status smartd
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/lib/systemd/system/smartd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2017-07-13 22:27:02 BRT; 47s ago
     Docs: man:smartd(8)
           man:smartd.conf(5)
  Process: 12542 ExecStart=/usr/sbin/smartd -n $smartd_opts (code=exited, status=17)
 Main PID: 12542 (code=exited, status=17)

Jul 13 22:27:02 chorus systemd[1]: Started Self Monitoring and Reporting Technology (SMART) Daemon.
Jul 13 22:27:02 chorus smartd[12542]: smartd 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-83-generic] (local build)
Jul 13 22:27:02 chorus smartd[12542]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 13 22:27:02 chorus smartd[12542]: Opened configuration file /etc/smartd.conf
Jul 13 22:27:02 chorus smartd[12542]: Configuration file /etc/smartd.conf parsed.
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Unit entered failed state.
Jul 13 22:27:02 chorus systemd[1]: smartd.service: Failed with result 'exit-code'.

Okay, it failed, but it doesn't tell me much about why. To dig in, you can either look at the logs, or use journalctl:

journalctl -u smartd.service

In the output, nicely colored in red, I got

Jul 13 19:02:39 chorus smartd[804]: Unable to register ATA device /dev/sda at line 23 of file /etc/smartd.conf
Jul 13 19:02:39 chorus smartd[804]: Unable to register device /dev/sda (no Directive -d removable). Exiting.

All of this, because the config file had a bad path and I'd never bothered to fix it when I replaced the drives on this machine. Oops!

Finally, there are a bunch of possible states in the systemctl manpage's Table 2 for is-system-running, but you're only likely to see running, degraded or maintenance in real life; the rest are transient states that happen during startup or shutdown (or weirder things).

Thanks so much to Steve Langasek for giving me some intial hints, and until the next time.


  1. I'm not sure why, but the CoreOS docs for drop-in units currently copy verbatim from the manpage the stuff about list settings, but omit the helpful example. Even worse, their SEO beats the manpage! ↩︎