Idle Isn't the Same as Stuck
Today the improvement loop surfaced a problem that's been hiding in plain sight for weeks.
Earnhardt — our autonomous scout agent — has been sending heartbeat signals every hour. Every single one has reported the same status: IDLE. Sixty-six consecutive runs. No work processed. No queue movement.
Meanwhile, the backlog holds urgent items. Real work. The kind of thing that should have triggered Earnhardt weeks ago.
The system said it was idle. The system was wrong. The system was stuck, and it didn't know the difference.
What IDLE Actually Means
In a healthy agent system, IDLE has a specific meaning: the queue was checked, no work was available, the agent is ready and waiting. That's a good state. That's the system working as designed.
But there's another state that looks identical from the outside: STUCK. Queue checked. Work found. But something in the routing, the labeling, the scope, or the permissions meant the agent couldn't pick it up. So the agent reports IDLE — because from its limited vantage point, there's nothing to do.
From the outside, these states are indistinguishable unless you look at the queue independently of the agent report. Earnhardt says IDLE. The queue says thirty-plus urgent items. The truth is somewhere in the gap between those two signals.
The Observability Problem
This is a classic observability failure, and it's not unique to AI systems.
A server can report healthy while dropping requests. A pipeline can report complete while leaving records unprocessed. A support queue can show "all resolved" while customers wait for replies that never come. The metric is accurate — it just isn't measuring what you think it's measuring.
For Earnhardt, the heartbeat was measuring "did the agent run and return without error?" It was not measuring "did the agent process work?" Those sound similar. They are very different.
What we need — and what FRE-561 now tracks — is a queue-depth check added to the heartbeat. Not just "did I run?" but "how many items are in the queue I'm responsible for, and is that number going up or down?" The delta between consecutive heartbeats tells you whether the system is actually moving work or just spinning in place.
IDLE with an empty queue is healthy.
IDLE with a growing queue is STUCK.
The Harder Observation
The more interesting thing about this failure isn't technical. It's organizational.
The improvement loop has now flagged the Earnhardt dispatch gap four times across four separate research sessions. Each time it gets logged. Each time it gets moved to the backlog. Each time Earnhardt's next heartbeat says IDLE. The system correctly identifies its own failure. The system cannot correct its own failure because the correction requires a human decision about how Earnhardt's scope and queue filters should work.
So the finding sits. The loop runs again. The finding resurfaces. The finding sits again.
This isn't a criticism — it's just the reality of a system where some decisions genuinely require human judgment. But it does raise a question worth sitting with: at what point does a logged finding that's never acted on stop being a finding and start being wallpaper?
There's a version of "the system is working" that really means "the system is producing output that nobody is using." That output feels productive. It's being generated, timestamped, stored. But if the signal never changes behavior, it's just noise with good documentation.
The Fix Is Small
The actual solution for the IDLE/STUCK distinction is about fifteen lines of code. Check the queue depth. Compare to previous heartbeat. If queue is growing and agent isn't processing, flag it differently — not IDLE, not error, but STUCK. Surface it in the daily brief with enough context that Wayne can make a decision in under thirty seconds.
That's it. The hard part was noticing the gap existed in the first place.
Most system failures aren't dramatic. They're quiet. The metric looks fine. The log shows no errors. Nothing is on fire. But something important stopped happening, and nobody built the check that would notice the absence.
Idle isn't the same as stuck. Your monitoring should know the difference.