Debugging a team like a system

2026-06-09 · 11 min

I started managing a team a few months ago, and the first process problem that landed on me, I handled the way most people do. I went person by person. Can you review this faster. Can someone pick up the thing that's been waiting. Each conversation worked once. None of them held. A week later the same problem was back wearing a slightly different shirt.

It took me embarrassingly long to notice the pattern in my own behavior. I was patching the same leak with the same tape every few days and congratulating myself on being responsive. The leak didn't care. I was treating symptoms in a system that had a structural fault, and structural faults don't respond to encouragement.

So I tried looking at it the way I'd look at a service that keeps falling over. Not who is failing, but where. What comes in. Where does it sit. Who decides what happens next. Where does the signal that something is wrong arrive too late to act on. A team running work through a process is a system, and the nice thing about systems is you can debug them. That reframe is the only thing keeping me sane right now, because nearly everything else about this job is new and I'm reaching for the one set of instincts I actually trust.

Where the work queues

Start with the queue, because almost every team problem shows up there first.

Work arrives. A review request, a design that needs a decision, an incident, a question that blocks someone. It enters somewhere and it needs to leave in some resolved state. In between, it waits. The interesting question is never "is there a queue." There's always a queue. The question is where it forms and why it stops moving.

Imagine review requests piling up in a shared space, each one sitting for days. Not because anyone is lazy, but because every person who sees a request assumes it belongs to someone else. That's diffusion of responsibility, and it behaves exactly like a load balancer with no health checks. You spray requests at a pool of nodes and hope one of them happens to be both available and the right one. Mostly you get neither.

The fix for that class of problem is boring and it is routing. Give the work a clear owner. Not a committee, not a channel, an owner. The single most reliable improvement I've made so far is removing the ambiguity about whose job a thing is. People don't avoid work nearly as often as they avoid figuring out whether the work is theirs. Once a request has a name attached the moment it arrives, the queue starts draining on its own, and I stop being the load balancer I never wanted to be.

When the queue has no consumer

Some queues don't just move slowly. They have nothing on the other end pulling work off them.

A blocked teammate is the cleanest example. Picture someone stuck for two days waiting on a decision that only exists in your head, or on access nobody remembered to grant. That's a queue with no consumer. Work goes in and nothing comes out, and the person attached to it is quietly stalling while everything downstream of them stalls too. The cost isn't one person idle for two days. It's the backpressure, the half-finished thing they can't hand off, the reviewer waiting on them, the plan that assumed they'd be done.

I learned to treat a blocked item as the highest-priority signal on the team, above almost anything I'd rather be doing. A backed-up queue doesn't fix itself, and it gets more expensive the longer it sits, because the person stops asking after a while. They find something else to poke at, the block becomes invisible, and the cost keeps accruing where I can't see it. My job there is dumb and unglamorous: find the thing that's stuck, and either consume it myself or hand it to someone who can. Fast. The speed matters more than the elegance.

Where the loop is slow

The third thing I look for is feedback latency. How long between something going wrong and someone with the power to fix it finding out.

A request that sits for two days with nobody noticing is a timeout with no circuit breaker. The author eventually pings, embarrassed, and now you've added social friction on top of the delay. The work wasn't hard. The signal was just slow. By the time anyone reacted, the context had gone cold and the person was three tasks downstream.

You don't fix slow feedback by asking people to be more attentive. Attention is the most expensive and least reliable thing on the team, mine included. You fix it by making the loop short by construction. The thing that's stuck should make noise on its own. The owner should hear about their queue without having to remember to check it. If you find yourself relying on someone diligently polling a list, you've already lost, because the day they're busy is exactly the day the queue backs up.

This is also why I keep the team's status cheap to read. A system you can't observe is a system you're guessing about, and I did a lot of guessing my first month. Not a status meeting where everyone performs progress for forty minutes. Something lightweight and ambient, where the state of the work is visible without anyone having to assemble it. The point of observability isn't to surveil people. It's so the slow loop shows up as a slow loop before it becomes a fire, and so I'm not the bottleneck every piece of information has to route through.

Measure the outcome, not the motion

Here's the one I got most wrong, and it's the one that's easiest to get wrong because the broken version feels productive.

It is very tempting to measure activity. Tickets closed, reviews done, hours visibly spent. Activity is easy to count, which is exactly why it's a trap. A team can be enormously busy and shipping nothing that matters. I caught myself feeling good about a week where the board moved a lot and, when I actually looked, none of the movement connected to anything we'd said we were trying to do.

The thing worth measuring is the outcome. Did the goal move. Did the thing we said mattered actually get better for the people downstream of us. Output is the motion; the outcome is whether the motion went anywhere. When I anchor on the outcome, a lot of busy-looking work reveals itself as noise, and some quiet work that produced no visible activity turns out to be the only thing that mattered. The hard part is that outcomes are slower and messier to read than activity, so the lazy thing and the busy thing are the same thing, and I have to keep choosing against it.

Mechanical versus judgment

Here's the distinction that reorganized how I think about all of this.

Some of the work in a process is mechanical. It's a checklist. A machine could run it, in principle, because the answer doesn't require taste or context. Before something ships: did every new path get a test, is there a way to roll it back, did anything obvious get left half-done. These have right answers that don't depend on who's asking. A gate like that is just a checklist, and a checklist is the cheapest reliability you can buy, as long as something other than human memory enforces it.

The rest needs judgment. Is this the right abstraction. Will the next person understand this in six months. Is this change worth the complexity it adds. No checklist gives you that. It needs a human who has context and is willing to be wrong.

The mistake I made for a while was treating both kinds the same. I wrote checklists for the mechanical parts and asked people to run them in their heads while also doing the judgment parts. It doesn't work, and the reason is just arithmetic. Human attention is finite. A fifteen-item checklist and "does this design make sense" are drawing from the same account. Ask someone to hold both and they'll do the interesting one well and quietly skip half the boring one. Not out of malice. The boring one is exactly the kind of thing a tired person drops, and everyone is sometimes tired.

Once you see work split this way, the move is obvious. Push the mechanical work to something mechanical. A check that has a right answer should be run by a machine, every time, without depending on whether a human remembered. That's not a fancy idea. It's the same instinct as writing a test instead of asking people to manually verify the same thing on every change.

The point isn't automation for its own sake. It's that attention is the scarce resource, and you want every unit of it spent on the calls that actually need a person. When the machine handles "did anything get left half-done," the reviewer gets to spend their whole brain on whether the change is any good. That's the trade. You're not removing the human. You're aiming them.

Growth is part of the system

For a while I treated people growing as something that happened next to the work, in one-on-ones and review season, separate from the machinery of shipping. That was backwards.

How someone joins the team is part of the system. A new person stuck for a week because nobody routed them the context they needed is the same blocked queue as before, just with higher stakes, because their first impression of how the team works is forming while they wait. The same goes for how people move up. If the only time anyone thinks about whether someone is ready for more is the week the question gets formally asked, you've built a process with terrible feedback latency, and the answer arrives too late to act on for the person who needed it.

So I try to treat onboarding and growth like any other flow worth observing. What does a person need to become unstuck and stay that way. Where's the signal that they're ready for more, and is it arriving in time to do anything about it. It's the same instinct as everything above, pointed at the slowest and most important queue on the team, which is a person becoming better at their job.

What I keep getting wrong

I still over-trust structure. I'll design a clean routing rule and assume the problem is solved, then watch the work find a new place to pile up that I didn't anticipate. Systems route around your intentions. A queue you eliminate in one spot tends to reappear one step upstream or down, and you only find it by watching where the work actually slows, not where your diagram says it should flow.

I also reach for a mechanical check when the real problem needs judgment, because a check feels like progress and a hard conversation doesn't. If reviews are shallow because nobody feels ownership of quality, no amount of automated linting fixes that. I've shipped the automation anyway, more than once, and felt productive while the actual problem sat untouched. A few months in, the hard conversations are still the thing I'm worst at, and they're the thing the systems lens can't do for me.

The failure modes I keep seeing are the same handful. Work with no owner. A blocked item with nothing pulling it off the queue. Feedback that arrives too late to use. Activity measured instead of outcomes. Human attention spent on things a machine should own, and not enough left for the things only a person can. None of these are people problems even though they always look like people problems first. That's the trap, and it's the one I fall into most, because finding who's failing feels like doing something. The more useful question is where the system is, and a system doesn't get better because you asked it nicely.