Rate Limiting

Rate limiter is the controller to monitor and control the amount of traffic allowed to protect the network or your network of the services from attacks and overwhelming the system and eventually avoid it from crashing down. In the context of HTTP, it restricts the number of requests a client can send within a specified time period. Once the threshold is exceeded, any additional requests are blocked.

What is Rate Limiting, Really?

Restricting or putting certain controls over how callers can access the service in controlled fashion for a certain timeframe

A user can publish no more than 50 posts per minute.
No more than 10 accounts per day can be created from the same IP address.
Only 10 queries per second are allowed on large eCommerce site.

Before diving into the design, let’s review the key benefits of implementing an API rate limiter:

Prevents Denial of Service (DoS) attacks – Excessive requests, whether intentional or accidental, can overwhelm servers. A rate limiter protects systems by rejecting calls beyond the defined limit. Major companies enforce such limits. For example:
- Twitter restricts users to 300 tweets per 3 hours .
- Instagram limits number of posts, likes and comments to restrict spams and bot generated contents.
Reduces cost – Limiting requests saves infrastructure resources and ensures better allocation to high-priority APIs. This is especially critical when using paid third-party APIs that charge per call (e.g., credit checks, payment processing, or health record retrieval).
Prevents server overload – By blocking excessive requests from bots or misbehaving clients, rate limiters help maintain system stability and consistent performance.

Why Rate Limiting Matters

In software systems, rate limiting plays the role of that gatekeeper. It’s the traffic cop that ensures clients don’t overload your APIs, services, or infrastructure. As system designers, we use rate limiters to prevent abuse, control load, and guarantee fairness. Without it, your service is like that roller coaster with no rules — one greedy client can ruin the ride for everyone else.

Why Should System Designers Care?

Because rate limiting is everywhere.

APIs: Protecting backend services (think Twitter API or Stripe API).
Authentication: Prevent brute-force login attacks.
Messaging systems: Throttling producers so consumers aren’t overwhelmed.
Databases: Preventing query floods that crash your DB.
User experience: Keeping things fair (one user shouldn’t hog resources).

As system designers, ignoring rate limiting is like building a dam with no floodgates. You’re just waiting for the inevitable overflow.

Core Algorithms of Rate Limiting

There are multiple ways to implement rate limiting. Each has pros, cons, and trade-offs.

1. Fixed Window Counter

Simplest approach: Keep a counter per client for each time window.

Example: Allow 100 requests per minute.
If counter > 100 → block until next minute.

Pros: Easy to implement.
Cons: Bursty. A client could send 100 requests at 11:59 and another 100 at 12:00 — effectively 200 requests in two seconds.

2. Sliding Window Log

Store timestamps of each request in a log. For each new request, clean up old timestamps outside the window and check the count.

Pros: Precise, no bursts.
Cons: Memory-heavy for large-scale systems (you’re storing lots of timestamps).

3. Sliding Window Counter

Improved version: Instead of storing all timestamps, you split the time into buckets and interpolate.

Pros: Balances precision and memory.
Cons: Still adds some complexity.

4. Token Bucket

Imagine a bucket filled with tokens at a steady rate (say 5 tokens per second). Every request consumes a token. If the bucket is empty, requests are denied or queued.

Pros: Supports bursts (you can use saved-up tokens). Widely used.
Cons: Needs careful tuning of refill rate vs. bucket size.

5. Leaky Bucket

Like a bucket with a small hole. Requests pour in, but they only leave (get processed) at a steady rate. If the bucket overflows, requests are dropped.

Pros: Smooths out traffic spikes.
Cons: Less flexible for bursty traffic than token bucket.

6. Distributed Rate Limiting Challenges

When you have multiple servers, how do you enforce limits globally?

Use a centralized data store (like Redis) to keep counters/tokens.
Use consistent hashing to route clients to a specific node.
Or apply local + global hybrid: local node limits, plus a central hard cap.

Real-World Examples

Twitter API: Limits number of tweets, likes, and follows per day.
ChatGPT : Recently due to Ghibli image trends , ChatGPT has limits set for per user image generation for 24 hours
Stripe API: Enforces per-second and per-minute caps.
Nginx: Has built-in limit_req_zone for rate limiting.
AWS API Gateway: Provides configurable rate and burst limits.

Each of these uses token bucket or sliding window approaches under the hood.

Designing a Scalable Rate Limiter

Let’s think like system designers. What should our architecture look like?

Step 1: Identify the Scope

Per user?
Per IP?
Per API key?
Global limit?

Often, you’ll mix and match (e.g., per-IP + per-user).

Step 2: Decide the Enforcement Point

Client-side: Nice, but clients can cheat.
Edge proxy (CDN, API Gateway): Best place to drop traffic early.
Backend service: Last line of defense.

Pro tip: Enforce at the edge whenever possible. Saves backend resources.

Step 3: Choose Algorithm

Want burst tolerance? → Token bucket.
Want smooth traffic? → Leaky bucket.
Want strict limits? → Sliding window log.

Step 4: State Management

Single instance: Just a counter in memory.
Distributed system: Use Redis with atomic operations (INCR, EXPIRE).
Ultra-high scale: Use sharded Redis clusters, gossip protocols, or approximate counters.

Step 5: Handling “What Now?”

Okay, you’ve hit the limit. What do you do?

Return 429 Too Many Requests.
Queue the request (best-effort).
Offer exponential backoff hints in headers.
Give premium users higher limits (hello, monetization ).

Advanced Topics

1. Rate Limiting vs. Throttling

Rate limiting: Hard cap on requests.
Throttling: Slowing down requests but not outright blocking.

Think of it as: rate limiting says “no more rides for you”, throttling says “you’ll have to wait in line.”

2. Fairness & Multi-Tenant Systems

In multi-tenant systems, you don’t want one tenant starving others. Rate limiting ensures fair distribution of resources.

3. Adaptive Rate Limiting

Static rules aren’t enough in dynamic environments. Adaptive limiters adjust based on system health:

If CPU is high → tighten limits.
If load is low → relax limits.

This is often coupled with circuit breakers and load shedding.

4. Observability

A rate limiter without metrics is like a car with no dashboard. Track:

Rejected requests.
Latency impact.
Distribution per client.

Use Prometheus + Grafana to visualize.

When Rate Limiting Saved Us

At one of my past companies, we exposed an API to down stream applications. One day, a partner accidentally pushed a buggy script that sent us with 10,000 requests per second.

Without rate limiting, our database would’ve melted. Instead, the rate limiter kicked in, blocking 95% of the calls, and the system kept running. The partner eventually fixed their script, when they were notified about rejected requests.

Wrapping Up

Rate limiting isn’t just a tech buzzword — it’s a survival mechanism. As system designers, we need to think about it not only in terms of algorithms but also user experience. Done right, it keeps systems healthy, fair, and scalable.

Next time you design an API or system, ask yourself:

How do I prevent abuse?
How do I ensure fairness?
How do I keep my backend alive during a storm?

Chances are, the answer will involve a rate limiter quietly standing guard.