Cloud & Infrastructure

2. Scaling & Availability

Mar 20, 2026·10 min read

Scaling & Availability

A load balancer routes traffic. But who decides how many servers there are? And what happens when an entire data center goes down?

This is the domain of auto scaling and high availability: two concepts that work hand-in-hand with load balancing to keep your system running no matter what fails.

Picture a restaurant: the load balancer is the host seating guests at tables. Auto scaling is the manager who calls in extra staff when it gets busy. High availability is having a second kitchen so service continues even if one catches fire.

Why Scale?

A single server has hard limits: CPU cores, memory, network bandwidth, disk I/O. When traffic exceeds what one machine can handle, you have two options:

Vertical Scaling (Scale Up)

Bigger machine: more CPU, more RAM. Simple but has a ceiling. You can't buy an infinitely large server, and you have a single point of failure.

Horizontal Scaling (Scale Out)

More machines. Distribute the load. No ceiling: just add more. This is what the cloud is built for.

The diagram shows a single server at 95% CPU. One slow query, one traffic spike, and it's down. This is why every production system scales horizontally.

A grocery store is a good analogy here: vertical scaling is making one checkout lane faster. Horizontal scaling is opening more lanes.

Scale Out & Scale In

Scale out adds instances when demand increases. Scale in removes them when demand drops, so you're not paying for idle servers at 3 AM.

As traffic ramps up in the diagram, a 5th VM spins up and immediately joins the load balancer's rotation.

Scale out when:

• Average CPU across instances > 70%

• Request rate exceeds X req/sec per instance

• Queue depth grows beyond threshold

Scale in when:

• CPU drops below 30% for sustained period

• Request rate is well within capacity

Cooldown period: After a scaling action, the system waits (typically 60-300s) before scaling again. This prevents thrashing: rapidly adding and removing instances as metrics oscillate around the threshold.

A thermostat works the same way. It doesn't turn the AC on and off every second; it waits to see if the temperature stabilizes.

Auto Scaling in Practice

Manual scaling means someone watches dashboards and clicks buttons at 2 AM. Auto scaling replaces that human with a policy engine that reacts in seconds.

Scaling policies:

→Target tracking: “Keep average CPU at 60%” (simplest, most common)

→Step scaling: “If CPU > 70% add 2, if > 90% add 5”

→Scheduled: “Scale to 10 instances every weekday at 9 AM”

→Predictive: ML-based, learns your traffic patterns

In GCP, the auto scaler is built into Managed Instance Groups (MIG). You set a target CPU utilization or custom Cloud Monitoring metric, and it handles the rest. In AWS, it's an Auto Scaling Group (ASG) with target tracking or step scaling policies attached.

The auto scaler and load balancer work as a pair: the LB routes traffic to healthy instances, the auto scaler ensures there are enough of them.

Health Checks: The Glue

Health checks are the feedback loop that makes everything work. Without them, the LB doesn't know which servers are alive, and the auto scaler doesn't know when to replace failed instances.

How they work:

The load balancer sends GET /health to each instance every 10-30 seconds.

Healthy: returns 200. Stays in rotation.

Unhealthy: returns 5xx or times out for N consecutive checks. Removed from rotation.

What a good health check tests:

• App process is running

• Database connection is alive

• Critical dependencies are reachable

In GCP, health checks are a first-class resource. You create them independently and attach to both the load balancer and the MIG. The MIG uses them for auto-healing: if an instance fails health checks, it's automatically recreated from the instance template.

It's similar to hospital vitals monitoring: if the heartbeat flatlines, the system responds automatically.

High Availability: The Nines

High availability (HA) is the ability of a system to remain operational despite failures. It's measured in “nines” of uptime:

99.9% (three nines)~8.7 hrs downtime/year

99.95%~4.4 hrs downtime/year

99.99% (four nines)~52 min downtime/year

99.999% (five nines)~5.3 min downtime/year

The key insight: you can't achieve HA with a single location. Hardware fails. Power goes out. Networks partition. The only way to survive is redundancy across failure domains.

In the diagram, VMs are spread across Zone A and Zone B. Each zone is a separate physical data center with independent power, cooling, and networking. If one zone fails, the other keeps serving.

Having two offices in different buildings works the same way: if one loses power, the other keeps running.

Zone Failover in Action

When Zone A goes down, the load balancer detects failed health checks and shifts all traffic to Zone B. No human intervention. No downtime.

The failover sequence:

1.Zone A instances fail health checks (3 consecutive failures)

2.LB removes Zone A instances from rotation (~30s)

3.All traffic routes to Zone B

4.Auto scaler detects reduced capacity → spins up more Zone B instances

This is the trifecta at work: health checks detect the failure, load balancing reroutes traffic, and auto scaling restores capacity.

In GCP, a regional MIG automatically spreads instances across zones. The HTTPS Load Balancer is global, so it can route around zone, region, or even continent-level failures. In AWS, you configure a multi-AZ Auto Scaling Group with an ALB.

GCP Load Balancer Types

GCP has a family of load balancers, each optimized for different traffic patterns:

External HTTPS LB (L7)

Global. Routes by URL path, host, headers. Terminates SSL. Integrates with Cloud CDN and Cloud Armor (WAF). This is the default for web apps.

TCP/SSL Proxy LB (L4)

Global. For non-HTTP traffic: databases, game servers, custom protocols. Terminates SSL without inspecting HTTP.

Internal HTTPS LB

Regional. For service-to-service traffic within your VPC. Not exposed to the internet. L7 routing for internal microservices.

Network LB (passthrough)

Regional. L4 passthrough - doesn't terminate connections. Lowest latency. Used for UDP, non-HTTP TCP, or when you need to preserve the client IP.

The AWS equivalents: ALB (L7), NLB (L4 passthrough), GWLB (inline appliances). AWS doesn't have a global LB; you use CloudFront + ALB per region instead.

Managed Instance Groups (MIG)

A Managed Instance Group is GCP's way of combining everything we've discussed into one managed resource.

What a MIG provides:

• Instance template: machine type, disk, startup script, container image

• Auto scaling: target CPU, custom metrics, schedules

• Auto healing: recreates instances that fail health checks

• Rolling updates: deploy new versions with zero downtime

• Multi-zone distribution: spreads instances across zones automatically

A regional MIG in us-central1 might run instances across us-central1-a, -b, and -c. If zone A has a hardware failure, the MIG redistributes to the remaining zones.

The AWS equivalent is an Auto Scaling Group (ASG) with a launch template, multi-AZ placement, and instance refresh for rolling deployments.

A MIG is essentially a factory floor manager: it follows a blueprint (template), keeps the right number of workers on shift (auto scaling), replaces anyone who calls in sick (auto healing), and spreads them across buildings (multi-zone).

The Full Architecture

Here's how it all fits together in a production GCP deployment:

Client

→ Cloud DNS

→ External HTTPS LB (global)

→ Cloud CDN (edge cache)

→ Cloud Armor (WAF/DDoS)

→ Regional MIG (multi-zone)

→ VM instances (auto-scaled)

→ Health checks (auto-healing)

→ Cloud SQL / Memorystore

How they work together:

Load BalancerRoutes traffic to healthy instances

Auto ScalingAdjusts instance count to match demand

Health ChecksDetects failures, triggers healing

Multi-ZoneSurvives data center failures

Load balancing, auto scaling, and high availability aren't separate concerns. They're a single system. The LB needs the auto scaler to ensure capacity. The auto scaler needs health checks to know what's broken. And multi-zone deployment makes all of it resilient.