Rethinking Availability and Fault Tolerance

Rethinking Reliability

We spend a lot of time trying to make things never fail.
But maybe that’s not the right goal.

In many cases, recovering from failure is more important than avoiding it.

Failure happens.
What matters more is how painful it is when it does.

No matter how much redundancy or automation you add, something will break.
The real question is:

How painful is it when it does?

A short, well-understood outage is often better than a mysterious, degraded state that nobody can debug.

If your team can detect, understand, and recover quickly, that’s real reliability.

Too often, observability setups look great in demos but don’t match how teams actually operate:

Instead of tracking everything, design observability around how you run and what matters to your business.

If your business outcome depends on:

…then measure that.

Not just CPU or memory.

When your alerts align with what your users and business care about, your system becomes:

Maybe the goal isn’t “don’t fail.”

Maybe it’s:

Perfect systems don’t exist.
But teams that recover well are the ones that win in the long run.

Before you chase another layer of redundancy, ask yourself:

Are we really building for availability, or are we building for recoverability?

If you focus on recovery and align observability with business outcomes, your system won’t just survive failure.

It’ll get stronger from it.

In our industry, we live with trade-offs.
Every decision gives us something and takes something away.

Choose wisely.