Jan 25, 2026

The Almost Correct System

Why production failures aren't caused by bad code, but by good code making different assumptions.

In modern service and cloud architectures, the most painful production failures aren’t usually caused by “bad code” in the traditional sense.

They’re caused by good code making different assumptions.

This is the reality of distributed systems. It’s uncomfortable to hear, especially if you’re a careful engineer who writes tests, handles errors, and thinks about edge cases. But once you see this pattern, you’ll start noticing it everywhere, from microservice outages to distributed deadlocks and system design interview questions.

1. The Baseline: Why “Working Code” != A Working System

We naturally test the things we can control: the client, the API, the database. We run integration tests between them. If every individual component returns the correct output for a given input, we say the code is “correct.”

In a simple, local program, this is the ground truth. If every function is correct, the program is correct. But in a distributed cloud architecture, this logic breaks down. You can have three “correct” services that, when combined, create a catastrophic failure.

The failure isn’t usually inside your code, it’s in the space between your services.

Each component is built with assumptions about how the rest of the system behaves. When those assumptions don’t match, the system becomes fragile, even if every line of code is technically perfect.

The assumption mismatch in practice

Let’s look at something boring on purpose: timeouts. Imagine this setup where every value looks reasonable on its own:

  • Client timeout: 2 seconds
  • Load balancer timeout: 5 seconds
  • Backend service timeout: 30 seconds
  • Database timeout: No limit

The Step-by-Step Failure

  1. The client sends a request.
  2. The backend is slow today (cold cache, lock contention, etc.).
  3. After 2 seconds, the client gives up and retries.
  4. The original request is still running in the backend (it has 28 seconds left).
  5. Now the backend is doing the same work twice.
  6. The database sees double load. Latency increases further.
  7. More clients retry. The system spirals.

No single component broke. The database didn’t crash; the service didn’t leak memory. The failure emerged from how their assumptions interacted.

Bridging the gap with explicit contracts

Every boundary in a system has a contract, whether you wrote it down or not. We often rely on implicit contracts:

  • “This request finishes quickly”
  • “Retries are safe”
  • “This operation runs once”

The problem is that when assumptions are implicit, different parts of the system invent their own version of reality. That’s where “almost correct” systems are born.

If a client times out at 2 seconds, the backend must know its work is no longer wanted. If a client retries, the operation must be idempotent.

How to reason about boundaries

To move from “Junior” to “Senior” systems thinking, you have to shift your primary question:

  • Junior-level thinking: “Is my code correct?”
  • Senior-level thinking: “What assumptions does my code make, and who depends on them?”

The longer a system lives, the more assumption drift it accumulates. To combat this, you need to implement alignment strategies:

  1. Align Timeouts: Upstream timeouts should generally be shorter than downstream ones only if you have aggressive retries; actually, a better pattern is Deadline Propagation, where the remaining time budget is passed along the request chain.
  2. Make Operations Idempotent: If a caller assumes they can retry safely, you must assume they will retry multiple times.
  3. Use Backpressure: If you assume the system can handle X load, you must have a way to say “no” when X is exceeded, rather than slowing down for everyone.

Why “almost correct” is worse than “broken”

Failing loud is a feature. When a system crashes or returns a 500, you know exactly when and where it broke. Experienced engineers aim for this “fail fast” behavior because it surfaces problems immediately.

The danger comes from the impulse often seen in junior developers to “handle” every error by hiding it. This leads to the most dangerous state: the almost-correct system.

Almost correct systems are quieter and more dangerous:

  • They pass unit tests.
  • They survive staging.
  • They fail only under specific load.
  • They fail only when timing is unlucky.

These failures are hard to reproduce because no single line of code is wrong. This is why postmortems often sound like: “Everything behaved as designed… just not together.”

A systems thinking checklist

When designing or reviewing a system, don’t start with implementation details. Start with failure questions to force assumptions into the open:

  • Retries: What retries this, and what is the retry budget?
  • Timeouts: Who times out first? Does the work stop when they do?
  • Idempotency: What happens if this exact request runs twice?
  • Partial Failure: What happens if the DB update succeeds but the cache update fails?
  • State: What state survives a crash, and what assumption does the next run make about that state?

Closing Thought

Great software isn’t built by eliminating bugs. It’s built by eliminating surprises. These surprises don’t come from bad code; they come from assumptions that were never made explicit.


Citations & Further Reading