
“Don’t mind me, I just need every page you’ve ever published immediately.”
There is a very strange war brewing between AI companies and classic web companies, and it’s being brought into sharp relief in the form of a new product launched by Cloudflare.
Cloudflare is a huge web-hosting company (they host about a fifth of all sites), and they’re very assertive about bots, denial of service attacks, and the like—they have to be, or they won’t be able to deliver the web pages. One of their most recent attack vectors has been AI company web spiders.
To explain: In order to build an LLM, you need to gobble up as much content as possible—ideally all the content in the world. One of the best sources of that content is the open web, which, after all, was designed so anyone could come by and look at stuff. The web is full of research papers, forum posts, Wikipedia pages, cat pictures, and more.
So if you’re an LLM company, you write a spider, which is a program that crawls the web and downloads pages. When you run it, it grabs up tons of content to feed into the LLM. This is called “spidering.” It sounds a little creepy, but spiders have always been part of the web—that’s how you get search engines.
But now you have lots of extremely well-funded AI companies, and they need to feed their models—with those web pages. There are rules that spiders are supposed to follow, but if you’re an AI company, you may decide that rules don’t apply to you. You run your program on a server somewhere and go on to the next task.
And it runs, and runs, and runs. No intervention necessary. For the website manager, this too-vigorous spidering is basically an attack—it can crash your server, cost you a lot of money for bandwidth, and leave all your other visitors out in the cold. Unlike Google’s spiders, which theoretically will drive traffic back to you when you match up in a search result, this will probably produce no benefit on your end. AI models don’t remember links or sources. They just digest tokens and excrete tokens.
This is all happening right now. Wikipedia, for example, is reporting its bandwidth bill has gone up 50% because AI bots keep gobbling it whole. Your only real recourse is to block the bad spiders. But…what if they don’t want to stay blocked? There are many, many ways to change identifying information and start spidering again. They go away, change their hairstyle (or IP address), and come back the next day to take even more. The spider can hide or put on a disguise, but the website cannot. After all, everyone knows your domain name.
This leaves one resort: Camouflage. Cloudflare’s strategy for doing that on behalf of its customers is a honeypot of useless web pages filled with accurate but meaningless facts. They call it “AI Labyrinth.” (I hope they make T-shirts.) If it determines you’re an AI spider, it points you into the labyrinth and starts to deliver pointless information in enormous quantities. Each page it generates has links to other pages. It goes on forever.
To the spider’s various classifiers, these pages will look totally fine. The spider will determine that it has found a free buffet of expensive gourmet food. But it’s actually eating out of a garbage pail. And to be clear—this is not a wacky experiment. It’s a product for Cloudflare’s many, many customers. You can go in there and check off the “AI Labyrinth” feature, and they’ll look for the bad spiders and feed them non-nutritive cognitive rubbish.
Cloudflare isn’t feeding misinformation, but someone else probably is—for the fun of it, or as a malicious state actor, or to ruin a competitor’s model-making abilities. It’s a terrible, recursive, AI-powered race to the bottom.
I use AI tools and want them all to have good models. But I have to agree with Cloudflare: This is probably the most ethical defensive measure against abusive AI spider-goblins they could enact on behalf of the open web. Regulation doesn’t work (and it’s not exactly a pro-regulation environment right now). Giant platform companies are pro-slop: Instagram is currently a flood of AI-generated horrors while Google was spinning up AI search results from thin air. The best defense is to cloak yourself—and let the bots think they’ve won.
I think this sort of thing draws a clear line between “The Web” and “AI.” The web is based on standards that, while imperfect, lower the cost of entry for everyone. The AI industry is focused, at least today, on grabbing as much territory as possible. And a lot of AI players see the web as a commodity to be exploited, not a commons to be protected. “Spidering websites to death in order to regurgitate their tokens” is as close to “clear cutting the forest for firewood” as things can get in tech.
So what now? Well—perhaps surprisingly—I’d bet on the web. It’s an imperfect, ungovernable system, and God knows it’s not the web of 25 years ago, but it’s still open, and it builds immunity over time. I hope LLMs start coming with their ingredients labelled, and I hope “ethical AI” becomes more of an option: An LLM that honors the robots.txt, downloads Wikipedia’s source files instead of spidering it, and showcases the deals it’s made with publishers, media companies, and universities.
While the average AI company employee seems to believe that the apex of culture is the Lex Fridman podcast, at a certain point, it’ll be cheaper to do things correctly, with contracts, instead of getting sued all the time. Shareholders will want this, too. Large corporations, governments, and universities represent a huge amount of buying power, and they will come with procurement rules and regulations.
I’ll give you an example of how things could go: Anthropic just launched Claude for Education. I hope that large university systems demand information on the provenance, quality, and ethics of the spidering that Claude does. Higher education has a vested interest in keeping the commons healthy, providing transparency, and making sure that citations are real, not hallucinated. Universities represent a big market for products like this, and they should use their buying power, their ethical standards, and their cultural positions to make sure AI companies aren’t just trashing the web or pirating every book they can.
We still live in a society, at least for now. AI and the web should be able to get along. Hopefully we don’t end up in a low-trust world of infinite honeypots and nonsensical slop. We could! But we don’t have to.