Scaling Web Apps in 2026: Vertical vs Horizontal, Honestly

Most discussions of scaling skip straight to Kubernetes and autoscaling groups. The reality for the vast majority of applications is more boring and much cheaper: scale vertically until it hurts, then scale horizontally only the components that demonstrably need it.

This guide is a practical breakdown of what each scaling axis gets you, when to switch, and what tends to break in the wild.

The two axes

Axis	What you do	Example
Vertical (scale up)	Same number of containers, more CPU / RAM per container	1 vCPU + 1 GB → 4 vCPU + 8 GB
Horizontal (scale out)	More containers (replicas) behind a load balancer	1 replica → 5 replicas

In practice, you'll do both — but in a specific order.

Phase 1: 0 → first paying customers

What you do: Run on the smallest tier. Don't optimise yet. You're optimising for time-to-validate-the-product, not for cost-per-request.

Indicative shape: 1 container, 1 vCPU, 512 MB – 1 GB RAM. One Postgres database (smallest tier). Maybe Redis if your stack genuinely needs it.

What breaks first: Almost always memory. Node and Python applications gobble RAM linearly with traffic; OOM kills look like 502s in your logs.

The fix at this tier: Scale vertically. Bump container memory. The cost delta is trivial; the engineering time saved is enormous.

Phase 2: 100s of users / day

What you do: Real traffic arrives. Latency starts mattering. You'll discover at least one slow query and one hot endpoint.

Indicative shape: 1 container, 2 vCPU, 1–2 GB RAM. Postgres bumped to a tier with SSD and ~100 connections. Redis if not already present.

What breaks first: Slow database queries. The fix is index work, not horizontal scaling. Most apps die in production because of one missing index, not insufficient compute.

The fix at this tier: Profile in production — Launchverse's Observability tab gives you the deploy success rate and resource utilisation; combine with application-level slow-query logging (SET log_min_duration_statement = 100ms in Postgres). Add the missing indexes. Result: 10× capacity gain on the same container.

Phase 3: 10,000s of users / day

What you do: Traffic spikes are real. You're paged at 03:00 because a viral mention caused an outage. You start thinking about HA.

Indicative shape: 2–4 containers behind a load balancer, 2–4 vCPU + 2–4 GB each. Postgres on a serious tier with read replicas if reads dominate. Redis sized for connection count and working set.

What breaks first: Stateful assumptions. Until now you may have stored sessions in memory, cached computed values in process, or written user uploads to local disk. Every one of those breaks the moment you have multiple containers. Sessions go to Redis; cache goes to Redis or a CDN; uploads go to S3 / R2 / B2.

The fix at this tier: Horizontal scaling. Launchverse can run multiple replicas of a project; configure the desired replica count in the project settings. Health checks gate traffic to healthy replicas.

Phase 4: 100,000s of users / day

What you do: You start specialising. Different parts of the system have different load profiles, so scaling them together stops making sense.

Indicative shape: Web app on N replicas. Background workers on M replicas (different scaling triggers). Database on a serious primary + read replicas + (possibly) a separate analytics replica. Cache cluster.

What breaks first: The database. Every other layer can be scaled cheaply; the database can't. Read replicas for read-heavy workloads, partitioning for write-heavy, CQRS where the contention is structural.

The fix at this tier: It's no longer a generic "scale" problem; it's an architecture problem. The work shifts from configuration to code — pulling out hot paths, denormalising for read performance, moving heavy reads to replicas.

When NOT to scale horizontally

A surprising amount of startup engineering effort is wasted scaling horizontally too early:

Sessions stored in memory. Going from 1 → 2 replicas without Redis sessions means users get logged out randomly. Either fix session storage first, or stay at 1 replica.
Uploads to local disk. Same problem; uploads vanish when the user's next request lands on a different replica.
Caches in process. Computed results in Map or dict don't share across replicas. Either move to Redis or accept duplicate work.
Cron jobs that run "once." With multiple replicas, the platform's cron task runs N times — once per replica — unless your cron tasks are project-scoped (which Launchverse's cron jobs are; they run once on a single container).

The pattern: make the application stateless first, then scale horizontally. Doing it in the other order causes data loss.

Vertical scaling has a ceiling

There's a practical cap on how big a single container can usefully get. Beyond ~16 vCPU / 32 GB, scaling vertically buys you almost nothing — your application can't use that compute single-process. It's not a hard rule, but it's a useful sanity check: if you're proposing a 64 vCPU container for a Node app, you're holding it wrong.

Auto-scaling

Auto-scaling is "the platform adds replicas when CPU is high; removes them when CPU is low." It's seductive and frequently misconfigured. Two warnings:

Scale-down windows must be slow. A flap (scale up, scale down, scale up, scale down) costs you cold-start latency on every cycle. Bias toward retaining replicas; cost is cheap, latency is not.
Cold-start cost varies by stack. A Go binary cold-starts in a few hundred ms. A Java app might take 30 seconds. Don't enable aggressive auto-scaling for stacks with slow cold starts; you'll trade peak-load capacity for latency outliers.

A pragmatic auto-scaling policy:

Setting	Recommended
Min replicas	2 (so a single container failure doesn't take you down)
Max replicas	10× your steady-state count
Scale-up trigger	CPU > 70% for 60s
Scale-down trigger	CPU < 30% for 300s (5 minutes — slow!)