Building Resilient Systems: Why Your Code Should Expect Failure

Every production system eventually faces the same uncomfortable truth: things fail. Networks partition, services crash, databases timeout, and disk space runs out at 3 AM. The difference between a system that survives these events and one that doesn't comes down to one word: resilience.

Abstract network nodes representing distributed system connections — Distributed systems are interconnected webs of dependencies — one failure can cascade through the entire network.

Why Resilience Matters

Imagine your application calls an external payment service. Under normal conditions, it responds in 200ms. But what happens when that service starts taking 30 seconds? Without resilience patterns, your application will:

Block threads waiting for responses that never arrive
Consume connection pool resources until nothing is left
Cascade failures to every service that depends on yours

This is exactly what happened in major outages at Amazon, Netflix, and countless other companies. The solution isn't to prevent failures — that's impossible. The solution is to design for failure.

Pattern 1: Retry with Exponential Backoff

The simplest resilience pattern is the retry. A transient network glitch shouldn't bring down your system. But naive retries are dangerous — if a service is overloaded, hammering it with retries makes things worse.

The solution is exponential backoff with jitter:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1.0):
    """Retry a function with exponential backoff and jitter."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise e
            # Exponential backoff with jitter to avoid thundering herd
            delay = min(base_delay * (2 ** attempt), 30.0)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Why jitter? Without it, all retrying clients hit the recovering service at exactly the same moment — a phenomenon called the thundering herd problem. Adding random jitter spreads the load.

Pattern 2: The Circuit Breaker

A circuit breaker is like an electrical fuse: when failures exceed a threshold, the breaker trips and stops requests from reaching the failing service. Instead, it returns immediately with an error or fallback response.

Abstract purple and blue digital network representing circuit breaker states — Circuit breakers prevent cascading failures by stopping requests to degraded services.

A circuit breaker has three states:

Closed: Normal operation. Requests flow through. Failures are counted.
Open: The failure threshold has been exceeded. All requests are rejected immediately. A timer starts.
Half-Open: After the timer expires, a limited number of test requests are allowed through. If they succeed, the breaker closes. If they fail, it opens again.

Here's a JavaScript implementation:

class CircuitBreaker {
  constructor(requestFn, options = {}) {
    this.requestFn = requestFn;
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 60000; // 60 seconds
    this.halfOpenMaxRequests = options.halfOpenMaxRequests || 3;
    
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.lastFailureTime = null;
    this.halfOpenSuccesses = 0;
  }

  async execute(...args) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime >= this.resetTimeout) {
        this.state = 'HALF_OPEN';
        this.halfOpenSuccesses = 0;
      } else {
        throw new Error('Circuit breaker is OPEN - request blocked');
      }
    }

    try {
      const result = await this.requestFn(...args);
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    if (this.state === 'HALF_OPEN') {
      this.halfOpenSuccesses++;
      if (this.halfOpenSuccesses >= this.halfOpenMaxRequests) {
        this.state = 'CLOSED';
        this.failureCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Pattern 3: Timeouts Are Not Optional

A timeout is your first line of defence, yet it's shocking how many production services use default timeouts of 30+ seconds or — worse — no timeout at all. The rule is simple: every network call must have a timeout, and that timeout should be based on your SLA, not your gut feeling.

import httpx

# BAD: no timeout (blocks forever on hung connections)
response = httpx.get("https://api.example.com/data")

# GOOD: explicit, SLA-aligned timeout
response = httpx.get(
    "https://api.example.com/data",
    timeout=httpx.Timeout(5.0, connect=3.0)
)

Set your timeouts based on the 99th percentile of your observed response times, plus a reasonable buffer. If your API normally responds in 200ms, a 5-second timeout gives plenty of room for occasional slow responses without letting your system hang indefinitely.

Pattern 4: Bulkhead Isolation

The bulkhead pattern borrows from ship design: compartments are isolated so that if one floods, the ship doesn't sink. In software, this means isolating different resource pools so that one failing dependency can't consume all your resources.

import axios from 'axios'];

// Create separate connection pools for different services
const paymentClient = axios.create({
  baseURL: 'https://payments.example.com',
  timeout: 5000,
  maxConnections: 10,  // Isolated pool
});

const notificationClient = axios.create({
  baseURL: 'https://notifications.example.com',
  timeout: 3000,
  maxConnections: 5,   // Separate isolated pool
});

If the payment service starts responding slowly, it can only consume its own 10 connections — the notification service's 5 connections remain untouched.

Putting It All Together

These patterns work best in combination. Here's how they layer together:

Timeout: The first gate — don't wait forever for any single request.
Retry with backoff: For transient failures, try again — but with increasing delay and randomness.
Circuit breaker: If retries keep failing, stop trying altogether and give the downstream service space to recover.
Bulkhead: Even if one service's circuit breaker is open, isolate the damage so other services keep working.

This layered approach is what companies like Netflix (with Hystrix), AWS, and Stripe use to keep their systems running at scale. You don't need to be Netflix to benefit — even a simple circuit breaker around your most critical external dependency can prevent cascading outages.

Key Takeaways

Expect failure — design your system assuming dependencies will fail, not hoping they won't.
Always set timeouts — a slow response is often worse than no response.
Add jitter to retries — synchronized retries create thundering herds.
Use circuit breakers — stop hammering failing services and give them room to recover.
Isolate resources with bulkheads — one failing service shouldn't take down everything else.

Resilience isn't about preventing every failure — it's about ensuring your system degrades gracefully and recovers quickly. Start with timeouts, add retries with backoff, then layer on circuit breakers and bulkheads where it matters most. Your future self, debugging a 3 AM incident, will thank you.