Block 404 abuse in HAProxy

Introduction

Everyone on the internet has encountered a 404 page at least once. Usually, it doesn’t really matter - it’s just a page you wanted to visit that no longer exists. Sometimes it’s frustrating, because the content you searched for, or worse: needed, has been taken offline.

What many people don’t think about, is that serving a 404 is often a resource-consuming matter. Why, you may ask? Well, a large part of the internet is powered by CMS systems. These systems make life easier by making content manageable for a broader audience. If posting content is the only thing you need to do, you don’t have to understand websites and CMSes thoroughly.

Think about it. How many times, when you encounter a 404 page, do you see things like suggested pages, personalised content, or perhaps you meant… Exactly! This is how the webmaster keeps you engaged with the site and its content: “This doesn’t exist, why don’t you try one of the articles below?” How do they know I might want to read about “Apple pie” instead of the “pumpkin pie” I searched for? By collecting and analysing data.

And here’s where the hidden cost comes in: every time a 404 page is served, the request often still runs through the entire application stack. The server receives the request, PHP wakes up, the CMS performs several database queries, PHP assembles the 404 page, complete with recommendations and personalised elements, and only then is the final page served to the user. And the fun part? Only to let you, the user, know the page doesn’t exist.

The problem

Now you’re probably asking: Why does that matter? Simple: every time someone accesses a page that doesn’t exist anymore, it consumes valuable resources that could have been used for users who are actually doing things: ordering products, browsing the site, posting content, etcetera. They might even face a 503 Service Unavailable just because someone is rapidly requesting pages that do not exist.

The problem isn’t that one user who hits a 404 page every once in a while. It’s the automated traffic (bots, crawlers) that cause the real trouble. Legitimate users only request a 404 page every X seconds, and just Y times in a row. “Hmm, /posts/apple-pie-recipe doesn’t work… maybe /posts/applepie-recipe?” Sure, that still consumes resources, but only to a limited degree.

The internet is full of crawlers. Just like regular visitors, they crawl a lot of pages, only much faster. Sometimes several requests per second. And every time they do that, your server has to perform something along these lines:

NGINX: Thanks for submitting your request.
NGINX: Hey PHP, the user wants to see /posts/apple-pie-recipe. Could you please serve this page?
PHP: Hey NGINX. Yes, of course. Let me check with the database… hold on…
PHP: Hey MySQL, could you check if this page exists?
MySQL: Nope. Page doesn’t exist. 0 records found.
PHP: Aw, too bad! Then I’ll serve a 404 page instead. Could you maybe give me some related posts? The user searched for “Apple Pie”, so anything about “Apple”, “Pie”, or “Recipes” would help.
MySQL: vomits… machine noises…
MySQL: Here you go. Everything about “Apple”, “Pie”, and “Recipes”.
PHP: Thanks MySQL, you’re the best!
PHP: Building a fantastic 404 page full of suggestions based on the user’s input…
PHP: Here you go, NGINX. All the HTML is ready to be served.
NGINX: Awesome! Sending it to the visitor…

Now imagine your server having to process this multiple times per second, for essentially nothing. It’s not helping the crawler, it’s not helping real users, and meanwhile MySQL’s CPU usage climbs, PHP workers get exhausted, and server load increases. The result? Real visitors encounter slow webpages, or worse: Service Unavailable errors.

Understanding 404 patterns in HAProxy

Cool, I understand. So what now? How can we detect this behaviour? The simple part of this, is that it’s often automated. If you check out the logs - after this kind of issue - you’ll see that the same client (IP-address) requests random pages over and over, every time resulting in a 404 Not Found. You might see multiple GET requests within a second to ‘whole’ URIs, like /posts/apple-pie-leet-recipe-roflcopter.

Example

[03/Dec/2025:07:26:25.566] web-in~ nginx/10.20.1.39 0/0/0/3/3 404 3362 - - ---- 8/8/0/0/0 0/0 "POST https://example.org/graphql HTTP/2.0"
2025-12-03T07:26:25.634585+01:00 lb1 haproxy[554938]: 139.59.132.8:9881 [03/Dec/2025:07:26:25.631] web-in~ nginx/10.20.1.39 0/0/0/2/2 404 1621 - - ---- 9/9/0/0/0 0/0 "POST https://example.org/api HTTP/2.0"
2025-12-03T07:26:25.665393+01:00 lb1 haproxy[554938]: 139.59.132.8:10105 [03/Dec/2025:07:26:25.662] web-in~ nginx/10.20.1.39 0/0/0/2/2 404 1621 - - ---- 9/9/0/0/0 0/0 "POST https://example.org/api/graphql HTTP/2.0"
2025-12-03T07:26:25.700014+01:00 lb1 haproxy[554938]: 139.59.132.8:10105 [03/Dec/2025:07:26:25.697] web-in~ nginx/10.20.1.39 0/0/0/2/2 404 3362 - - ---- 9/9/0/0/0 0/0 "POST https://example.org/graphql/api HTTP/2.0"
2025-12-03T07:26:25.735664+01:00 lb1 haproxy[554938]: 139.59.132.8:9881 [03/Dec/2025:07:26:25.733] web-in~ nginx/10.20.1.39 0/0/0/2/2 404 1621 - - ---- 9/9/0/0/0 0/0 "POST https://example.org/api/gql HTTP/2.0"

This is an actual example of the load balancer all your traffic towards this blog goes through. As you can see, the client has triggered 5 404-errors within the same second: 07:26:25. A real user cannot do this by accident. So this is either malicious, or automated. Either way: unwanted traffic we don’t want.

How HAProxy can help

This is where HAProxy can help. Instead of letting your backend waste resources generating expensive 404 pages over and over again, you can detect this behaviour before it even reaches your application stack. Using “Stick Tables”, HAProxy can track how often a client hits 404 responses. If a client suddenly fires off too many missing-page requests in a short timespan, HAProxy can:

Rate-limit or slow them down
Temporarily block them
Redirect them to a lightweight static 404 page

And the cool part? All while keeping your backend healthy and responsive for actual visitors.

Taking action with HAProxy

Now we understand why this kind of traffic is annoying and why it burns valuable “CPU Cycles”, we can focus on how to stop it. Since we’re using HAProxy, we can intercept traffic and decide what to do with it before it gets routed to a backend server. The cool part is that everything I’ll write about is built-in. You don’t need any other packages, plugins or anything else. HAProxy is a really powerful tool for this kind of stuff.

The core of what we need revolves around stick tables. If you’re not familiar with Stick Tables, you can compare them to a tiny in-memory database inside HAProxy. It keeps track of things like:

How many requests a client makes
How many errors they trigger
The rate of specific responses (like 404s or other 4xx codes)
A lot more, check it out: HAProxy Stick Tables documentation

Explaining the setup

The countermeasure we’re going to implement consists of a couple of components that work closely together. If you’re not familiar with HAProxy, I recommend reading up on it before trying anything like this in production. If you’re just reading for fun, or looking for an experiment: read on. I will explain each component below.

Components

ACL: ‘Access Control List’. With an ACL you can set up specific rules. E.g. “Check if path starts with /posts”, “Load a list of forbidden URIs”
Stick Table: A place to save information about requests, clients, etcetera.
http-request: Used for anything related to HTTP requests: tracking requests, denying requests, setting headers, etcetera.

Defining a Stick Table

Whenever I set up a new countermeasure, I often start with defining ACLs. Since this is quite an advanced setup, we’ll have to define the stick table first. This is needed because we’ll use our ACLs to store information about the general purpose counters (a bit more about this below). For now it makes more sense to start with the stick table.

We will define a stick table named st_404_tracking in a separate backend. This way we can have multiple stick tables in our setup. We will use the type ip, so we track clients by their source IP.

backend st_404_tracking
  stick-table type ip size 1m expire 20m store gpc0,gpc1,gpc0_rate(10s)

Breaking it down

We’ve defined a stick table called st_404_tracking. This can be named anything, as long as it makes sense to you. In this stick table we chose to track IPs (type ip), set the size to 1 million IP-addresses, set a validity of 20 minutes and store a couple of counters.

The counters we want to store in this case are: gpc0 and gpc1. I could go in-depth about how counters in HAProxy work, but that’s out of the scope for this write-up. GPC stands for General Purpose Counters. We’re using gpc0 to count 404 hits and gpc1 as a ‘flag’: 0 means no block, 1 means blocked.

The validity (expire) is important. In this case we choose to store everything for 20 minutes. Meaning: if you trigger a 404 error, this information will be kept for 20 minutes. This doesn’t mean you’re automatically blocked — we just want to save the information for 20 minutes. If you trigger another 404 within that window, the 20-minute timer resets.

The last part is a counter as well. I’ve saved this for last so it makes a bit more sense. It’s declared in the same place as the other counters, but this one is special. The gpc0_rate(10s) counter monitors the rate over 10 seconds. For example: if you trigger something, it checks how often that happened in the last 10 seconds.

Request tracking

After we’ve declared the stick table, we can move on to Request Tracking. With request tracking, we track the actual request and save everything we need for this use case. This line is straightforward.

http-request track-sc0 src table st_404_tracking

Breaking it down

We’ve defined a tracking rule. We’re tracking the sc0 counter in the st_404_tracking stick table. The sc in sc0 stands for Stick Counter. Basically: every request that comes in will be registered in the st_404_tracking table. Because we’ve defined type ip in the stick table, the IP-address will be used as the key.

Access Control Lists (ACLs)

Let’s move on to the Access Control Lists (ACLs). The ACLs in this case are used to check whether certain thresholds have been reached. We will use these conditions later for the actual blocking.

acl 404_rate_exceeded sc0_gpc0_rate(st_404_tracking) ge 5
acl 404_blocked_ip sc0_get_gpc1(st_404_tracking) gt 0

Breaking it down

We’ve defined two ACLs. One of them checks if the stick counter 0 has exceeded the threshold of 5 (ge = greater or equal). The second one is just a boolean to see if an IP-address is blocked: 0 is fine, 1 is banned.

As you can see, we reference the same General Purpose Counters as defined in the stick table. It’s important to keep track of that (pun intended).

Response counters

The http-response directive below applies an action whenever HAProxy sends a response back to the client. The sc-inc-gpc0(0) increments GPC0 for the current source IP in the first stick counter, whenever a 404 is returned. In short: every time an IP gets a 404 response, the counter for “404s in the last 10 seconds” increases by one.

http-response sc-inc-gpc0(0) if { status 404 }

If you look closer, gpc0 is referenced in the ACL above the ge 5 one. Every time a 404 is triggered, this counter increments by one. The ACL dynamically evaluates per client.

The http-response below checks if the threshold has been exceeded.

http-response sc-inc-gpc1(0) if 404_rate_exceeded

sc-inc-gpc1(0) increments the second GPC (GPC1). This is used as the ‘blocked’ flag. if 404_rate_exceeded applies only if the ACL from earlier (ge 5) is true. In short: once an IP exceeds the 404 rate limit, it gets flagged as blocked.

Blocking the request

Everything we’ve done so far is just tracking requests and setting ACLs based on the outcome. Now the fun part: blocking bad actors. Using http-request we can make sure bad actors either get dropped immediately, or receive a fun response. For my own amusement, I return the 418 status page (I’m a Teapot!). You could also silently drop the connection, or return a small static HTML page. Whatever suits your taste.

http-request return status 418 if 404_blocked_ip

In this case we block bad actors only if they triggered the 404_blocked_ip ACL, which is either true or false. Blocked yes or no is based if the threshold has been reached :).

Improvements

In practice, you’ll often be dealing with clients who want to scan their own website using SEO tools like Screaming Frog. We don’t want them getting blocked after 5 (or so) 404s, since that would be slightly annoying. The simplest way to prevent this is by maintaining a whitelist of IP-addresses that will never be blocked.

IP whitelisting

acl ip_whitelist src -f /etc/haproxy/acl/whitelist.acl

This ACL reads a whitelist.acl file, which is simply a list of IP-addresses. All the IP-addresses in this file are loaded into the ACL. The only thing you’ll need to modify is the http-request return we applied earlier.

http-request return status 418 if 404_blocked_ip !ip_whitelist

Return a 418, if an IP has been flagged for blocking — but only if it’s not whitelisted.

Excluding user agents and static content

Legitimate bots like Googlebot or Bingbot continuously crawl content to check whether it’s still available. The countermeasure we’ve implemented doesn’t really care about who the visitor is or what content they tried to access. All it cares about is the response code.

Therefore it’s wise to also exclude known good user-agents and common static file extensions. Luckily this is straightforward, just a matter of creating additional ACLs and adding them to the deny rule expressions.

acl static_content path_end -i jpg png svg css woff tiff jpeg
acl whitelist_useragents hdr_sub(user-agent) -i google bing

You can expand these ACLs as needed. If you have a large number of user-agents to whitelist, you can load them from a file instead. Refer to IP Whitelisting above for an example of how that works.

Final configuration

Now that all the individual pieces are in place, let’s put it all together. Below is the complete HAProxy configuration for this countermeasure. You can drop this into your existing setup and adjust the thresholds, whitelist paths, and response codes to match your environment.

# Stick table backend: stores 404 hit counters per IP
backend st_404_tracking
  stick-table type ip size 1m expire 20m store gpc0,gpc1,gpc0_rate(10s)

frontend web-in
  # ... your existing frontend config ...

  # Track all incoming requests by source IP
  http-request track-sc0 src table st_404_tracking

  # ACL: rate threshold exceeded (5+ 404s in 10s)
  acl 404_rate_exceeded sc0_gpc0_rate(st_404_tracking) ge 5

  # ACL: IP has been flagged as blocked
  acl 404_blocked_ip sc0_get_gpc1(st_404_tracking) gt 0

  # ACL: whitelisted IP-addresses (clients, known bots, etc.)
  acl ip_whitelist src -f /etc/haproxy/acl/whitelist.acl

  # ACL: static file extensions: don't count these as 404s
  acl static_content path_end -i jpg png svg css woff tiff jpeg

  # ACL: known legitimate crawlers
  acl whitelist_useragents hdr_sub(user-agent) -i google bing

  # Block flagged IPs (unless whitelisted): return 418 I'm a Teapot
  http-request return status 418 if 404_blocked_ip !ip_whitelist

  # Increment 404 counter per IP on every 404 response
  http-response sc-inc-gpc0(0) if { status 404 } !static_content !whitelist_useragents

  # Flag IP as blocked once rate threshold is exceeded
  http-response sc-inc-gpc1(0) if 404_rate_exceeded

That’s it. HAProxy will now silently absorb the bulk of abusive 404 traffic before it ever reaches your application stack, keeping your backend healthy and your PHP workers well-rested.

Happy load balancing.