How to Log 1M Events/Sec Without Slowing Down Your System

Article cover

Every modern system—fintech, ad-tech, multiplayer games, e‑commerce, fantasy sports—relies on logs as breadcrumbs of truth. Individually, events look tiny; collectively, they become millions per second.

Logging feels trivial when traffic is small. At scale, it becomes one of the most expensive operations in your system.

Logging is not “just write a line.” It’s I/O, storage, latency, durability, retention, indexing, retrieval, and analytics. Once traffic crosses a threshold, naïve approaches fall apart.

First Attempt: Log Directly Inside App Code

Most beginners start here:

log.info("User " + userId + " clicked Buy Now");

Why it feels right: simple and colocated with business logic.

Why it fails at scale:

Logging is blocking I/O.
CPU cycles compete with business logic.
Disk bandwidth bottlenecks if writing to disk.
Network latency kills throughput if sending externally.
Synchronous writes pause requests when I/O is slow.

At 1M events/sec, even 1ms per log means:

1,000,000 events × 1ms = 1,000,000ms = 1000 seconds latency → impossible.

Synchronous logging dangerously couples app performance with logging performance.

Attempt 2: Batch Logs and Write Periodically

buffer.append(event)
 
if buffer.size >= 1000:
    writeToDisk(buffer)
    buffer.clear()

Why it feels right: amortizes I/O and reduces write frequency.

Why it still breaks:

Process crashes → buffered logs vanish.
Bursts exceeding buffer capacity → drops.
Periodic flushes require locking → thread contention.
Slow storage → buffers back up → memory explodes.

Batching helps but doesn’t decouple logging from request handling.

Attempt 3: Log to a Database

SQL/NoSQL is durable and queryable—sounds nice until:

Inserts become bottlenecks.
Locks and indexes kill write throughput.
Querying logs impacts app data.
Storage costs skyrocket.
Retention is a nightmare.

Typical write capacities (ballpark):

MySQL: ~10k writes/sec
MongoDB: ~50k–200k writes/sec
Cassandra: high write throughput but expensive at scale

1M/sec destroys this model, financially and operationally.

Databases are for structured data, not firehose-scale telemetry.

Attempt 4: Send Logs Directly to Elasticsearch/OpenSearch

Search feels perfect until indexing overhead hits:

Tokenization, sharding, replication, compression, persistence.

Large clusters struggle beyond ~200k–300k events/sec without massive hardware, heavy tuning, and big maintenance.

Elasticsearch/OpenSearch is great for querying logs—not ingesting raw firehose directly from apps.

The Scalable Architecture

Logging must be asynchronous, streamed, and decoupled from the application.

1) Application Layer: Non-Blocking Logging

Async log appenders.
In-memory ring buffers.
Never block the request path.

Libraries: Logback AsyncAppender/Disruptor (Java), Zerolog (Go), Pino (Node), Rust tracing.

2) Queue or Stream: The Shock Absorber

Absorbs bursts, smooths throughput.
Producers write at memory speed.
Consumers scale independently.
Backpressure and replay/retention built-in.

Kafka with proper partitioning can handle millions/sec.

3) Storage Layer: Hot + Warm + Cold

Hot: recent, searchable, indexed for fast ops visibility.
Warm: aggregated analytics-friendly; reduced indexing and cost.
Cold: long-term retention in object storage; minimal indexing, cheapest.

Promote only what’s useful; don’t index everything.

4) Observability: Metrics Over Logs

Prefer counters, histograms, and traces for repetitive signals.

Log what matters. Aggregate what repeats.

Why This Architecture Works

The app writes and moves on—no waiting, no blocking. As traffic grows, scale consumers and partitions. Storage becomes smart: fast where needed, cheap everywhere else. Failures don’t drop events; buffers and streams keep absorbing. Logs flow like a river, not a traffic jam.

Interview Answer

I wouldn’t log directly. I’d log asynchronously, push events into an in-memory buffer, stream them through Kafka, and store them in tiers: hot for recent searchable data, warm for analytics, and cold for long retention. That way logging never slows requests, stays durable, and scales without rewriting the system.

Potential Follow-Ups

How to avoid log duplication?
How to detect ingestion lag/remove backpressure?
What if Kafka goes down?
How to avoid logging sensitive/PII data?

Overview One on One