Notes on building systems that work and other things... maybe?

The macOS shell-init race

· 7 min

Reboot. Open a terminal. Type git status.

zsh: command not found: git

Not every time. That was the maddening part. Maybe one boot in five, and only the first terminal I opened. Wait twenty seconds, open a new tab, everything works. git, nix, all my tools, back from the dead. So for a while I did the dumb thing: I closed the tab and opened another one. Problem solved, by which I mean ignored.

The tool that was missing changed depending on how fast I typed. Sometimes git. Sometimes nix. Once it was ls that came back from /bin instead of my coreutils, which is its own special kind of confusing because ls "worked," it just behaved differently. The common thread: anything Nix put on my PATH could be gone, and the longer I waited before checking, the more likely it was fine.

That last detail is the whole story, I just didn't read it that way at first.

Where the PATH actually comes from

On nix-darwin, my shell doesn't hardcode its PATH. There's a generated set-environment script that gets sourced at the top of /etc/zshenv, and it sets the canonical thing:

$HOME/.nix-profile/bin:/etc/profiles/per-user/$USER/bin:/run/current-system/sw/bin:...

/run/current-system/sw/bin is where the system profile lives. That symlink, /run/current-system, is what nix-darwin flips when it activates a generation. No activation, no symlink, no sw/bin, no git.

My first instinct was that the PATH itself was wrong. It wasn't. I spent an evening adding export PATH=... lines to my zsh config, watched them help, then watched them quietly break something else: brew's gpg started winning over the nix one, because my hardcoded PATH dropped /etc/profiles/per-user/$USER/bin. So I ripped all of that back out and left a note for future me, which is still there:

# NOTE: Do NOT export a hardcoded PATH here. nix-darwin's
# /nix/store/.../set-environment script (loaded automatically at the top
# of /etc/zshenv) already sets the canonical PATH:

The PATH was correct. The problem was when it existed.

The race

nix-darwin activation runs as a launchd daemon, org.nixos.activate-system. launchd starts it at boot. launchd also starts your login session, your window server, your terminal. These things do not wait for each other in any order I get to pick. The daemon that creates /run/current-system and my shell that reads /run/current-system/sw/bin are racing, and on a slow boot the shell wins. It starts, sources set-environment, finds no system profile, and hands me a PATH with a hole in it.

Open a terminal twenty seconds later and activation has long since finished. That's why waiting "fixed" it. I wasn't fixing anything, I was just losing the race less often.

Once you see it as a race the fix is obvious in shape: the shell has to wait for the marker before it trusts the PATH. So my shellInit polls for the symlink:

programs.zsh.shellInit = ''
  # Wait for nix-darwin activation (with 15s timeout to prevent hung shells).
  if [ ! -e /run/current-system/sw ]; then
    if /bin/wait4path /nix/store 2>/dev/null && [ ! -e /run/current-system/sw ]; then
      _nix_wait=0
      while [ ! -e /run/current-system/sw ] && [ $_nix_wait -lt 15 ]; do
        sleep 1
        _nix_wait=$((_nix_wait + 1))
      done
      unset _nix_wait
    fi
    if [ ! -e /run/current-system/sw ]; then
      echo "[nix-darwin] WARNING: /run/current-system/sw not found after 15s. Run 'make fix-nix' to recover." >&2
    fi
  fi
'';

A few things in here are not obvious, and I got most of them wrong on the first pass.

/bin/wait4path is an Apple binary that blocks until a path shows up. I use it on /nix/store, not on /run/current-system/sw, on purpose. The store mount is the thing launchd is genuinely slow to bring up on boot. There's no point spinning my own sleep loop while the disk isn't even mounted yet, so I let Apple's tool block on the mount, then poll for the activation marker once the store is actually there. If wait4path succeeds and the symlink still isn't present, then we're in the real race and the second-by-second loop takes over.

Why the timeout matters more than the wait

The escape hatch is the part I'd skip if I were being clever, and skipping it is exactly the bug.

Picture the wait without && [ $_nix_wait -lt 15 ]. The shell would block until /run/current-system/sw appears. Most of the time, fine, it appears in a second or two. But if activation genuinely failed, if the daemon never loaded, if I broke my own config and darwin-rebuild left no working generation, then the symlink never appears and the shell waits forever. Every new terminal hangs. You can't open a shell to fix the thing that's stopping your shells from opening. I have done this to myself. It is not a good afternoon.

So the loop is bounded to fifteen seconds, and when it gives up it prints a warning that tells me what to run:

[nix-darwin] WARNING: /run/current-system/sw not found after 15s. Run 'make fix-nix' to recover.

A degraded shell with a hole in its PATH is annoying. A hung shell that won't open is a recovery problem. Fifteen seconds covers every honest boot race I've ever measured, and the warning turns the unrecoverable case into a readable one. The wait fixes the common case. The timeout makes sure the rare case stays survivable.

That ordering, by the way, is the only meaningful change in the commit that fixed this. The earlier version ran wait4path and then always entered the poll loop, and unset _nix_wait lived at the very end where it didn't always get reached. Folding the loop inside the wait4path success branch means I don't spin a second-by-second poll when the real problem is just a slow store mount. Small diff. It reads like a refactor. It's the difference between waiting on the right thing and waiting on the wrong one.

The thing the shell can't fix

Polling makes a correct shell patient. It does nothing if activation never runs at all.

And launchd, it turns out, will occasionally just not pick up org.nixos.activate-system from /Library/LaunchDaemons on boot. I can't reproduce it on command and I can't fully explain it. The daemon is installed, the plist is right, and some mornings the service just isn't loaded. When that happens no amount of waiting helps, because the thing I'm waiting for is never coming. The shell times out, prints its warning, and I go run make fix-nix.

I didn't want to run anything by hand. So there's a second daemon whose entire job is to make sure the first one runs. A watchdog, installed outside of nix-darwin so it doesn't depend on nix-darwin activation to exist, which would be circular:

<key>ProgramArguments</key>
<array>
	<string>/bin/sh</string>
	<string>-c</string>
	<string>/bin/wait4path /nix/store &amp;&amp; (/bin/launchctl kickstart system/org.nixos.activate-system 2>/dev/null || /bin/launchctl bootstrap system /Library/LaunchDaemons/org.nixos.activate-system.plist 2>/dev/null || true)</string>
</array>
<key>RunAtLoad</key>
<true/>

wait4path again, same reason: don't touch launchctl until the store is mounted. Then try to kickstart the service (restart it if it's already loaded), and if that fails, bootstrap it (load it from scratch). One of those is the right move depending on whether launchd half-loaded the daemon or skipped it entirely, and I don't know which state I'll be in, so I try both and || true my way past whichever one was already done.

The bootstrap that installs this watchdog lives in fix-nix.sh, and it's careful to only reinstall when the plist actually changed:

if ! cmp -s "$WATCHDOG_SRC" "$WATCHDOG_PLIST" 2>/dev/null; then
    echo "🔧 Installing activate-system watchdog..."
    sudo cp "$WATCHDOG_SRC" "$WATCHDOG_PLIST"
    sudo chown root:wheel "$WATCHDOG_PLIST"
    sudo chmod 644 "$WATCHDOG_PLIST"
    sudo launchctl bootout system/org.nixos.activate-system-watchdog 2>/dev/null || true
    sudo launchctl bootstrap system "$WATCHDOG_PLIST" 2>/dev/null || true
fi

So there are two layers, and they're answering two different failures. The shell wait handles activation running late: be patient, then give up loudly. The watchdog handles activation not running at all: kick it on every boot, after the store is up. Neither one alone closes the gap. The wait can't summon a daemon that launchd forgot, and the watchdog can't make a shell that started too early go back and re-read its PATH.

I still don't know the exact conditions where launchd drops the daemon. I've stopped needing to. The shell waits when it should and bails when it must, and the watchdog quietly re-arms the thing every boot. I haven't opened a fresh terminal to command not found: git since.

I do still sometimes close the tab and open a new one. Old habits.