Cloudflare chief executive officer (CEO) Matthew Prince has confirmed that the global internet outage on 18 November 2025, an incident that disrupted major services including X, ChatGPT, Canva, Discord and thousands of websites worldwide, was not triggered by a cyberattack, but by an internal configuration error that spiralled into one of the most serious failures in the company’s recent history. Mr Prince apologised to customers and the broader internet community, saying, “Given Cloudflare’s importance to the Internet ecosystem, any outage of any of our systems is unacceptable. We know we let you down today.”
In a detailed
postmortem published hours after services were restored, Mr Prince explained that the problem began with a routine change to permissions on a ClickHouse database cluster. "The update was intended to improve access controls and enhance reliability, but instead caused the system to generate duplicate entries in a key 'feature file' used by Cloudflare’s bot management system. The file, which is refreshed every five minutes, suddenly doubled in size and that proved catastrophic."
Cloudflare’s routing software has a strict upper limit on how large this feature file can be before it fails to load. When the oversized file propagated across the network, machines at the edge began to crash, returning widespread HTTP 5xx errors to users around the world. “The software had a limit on the size of the feature file that was below its doubled size,” Mr Prince wrote. “That caused the software to fail.”
What most confused users as well as Cloudflare engineers was the unusual, intermittent nature of the failure. In the first three hours, from around 11:20 coordinated universal time (UTC), Cloudflare’s global network repeatedly swung between working normally and collapsing again. Mr Prince described this as “very unusual behaviour for an internal error.”
The reason, he explained, was that the faulty file was only generated by parts of the database cluster that had already been updated with the new permissions change. When the file came from an unaffected node, the system temporarily stabilised. When it came from an updated node, it failed again. Every five minutes, the network had a chance of receiving a 'good' file, which allowed a brief recovery or a bad one that crashed systems again.
“It made it unclear what was happening,” Mr Prince admitted. “Initially, this led us to believe this might be caused by an attack.”
The suspicion was intensified when Cloudflare’s status page, which is entirely hosted outside Cloudflare’s infrastructure, also became briefly inaccessible. Engineers feared that the company was facing a coordinated, high-volume DDoS attack, similar to recent Aisuru campaigns targeting major internet infrastructure.
“It was a coincidence, but it led some of the team to believe an attacker may be targeting both our systems and our status page,” Mr Prince says.
However, Jake Moore, global security adviser at ESET, points out that the outages we have witnessed over the last few months have once again highlighted the reliance on these fragile networks. "Companies are often forced to heavily rely on the likes of Cloudflare, Microsoft, and Amazon for hosting their websites and services as there aren't many other options. The problems causing these outages have occurred due to a domain network system (DNS) problem which is most likely overwhelmed. The technology is based on an outdated, legacy network that redirects words in web addresses into computer-friendly numbers. When this system fails, it catastrophically collapses and causes these outages."
"However, the problem is that this system cannot be replaced easily. It may sound risky, but the major cloud providers actually have lots of impressive fail-safes in place and usually provide more protection than the lesser well-known cloud providers," he added.
The cascading failures affected multiple Cloudflare services beyond its core content delivery network (CDN) and security network. Turnstile, Cloudflare’s CAPTCHA-alternative service, failed to load, leaving many users locked out of dashboards. Workers KV, a key storage layer for Cloudflare’s serverless platform, produced elevated 5xx errors. Access, the company’s identity layer, experienced widespread authentication failures.
While traffic flowed through some legacy proxy systems, bot scores were not generated correctly, causing sites that depend on bot-filtering rules to block legitimate traffic.
The outage also manifested in the form of higher latencies and timeouts, as Cloudflare’s observability and debugging systems automatically kicked in, adding additional load to an already struggling network.
Cloudflare engineers eventually traced the root cause to the misconfigured ClickHouse permissions and the runaway feature file. By 14:30UTC, the team stopped propagation of the bad file, restored an earlier known-good version, and began systematically restarting affected systems. By 17:06UTC, Mr Prince reported that all Cloudflare systems were 'functioning as normal'.
“This was Cloudflare’s most serious outage since 2019,” he says, adding that the incident caused “deep pain” to the entire team.
At the heart of the failure was a subtle and unintended consequence of a database permissions update. When the updated ClickHouse cluster generated metadata using a common query, one that expected results only from Cloudflare’s 'default' database, it inadvertently included metadata from an additional database (named r0). This effectively duplicated the number of machine-learning features in the generated file.
Cloudflare’s bot management system expects no more than 200 features, with memory pre-allocated for performance reasons. The malformed file exceeded this limit, triggering a panic error in the Rust-based federated learning (FL2) proxy engine. That panic cascaded across Cloudflare’s global edge, generating the widespread 5xx outages.
Mr Prince apologised to customers and the broader internet community, saying, “Given Cloudflare’s importance to the Internet ecosystem, any outage of any of our systems is unacceptable. We know we let you down today.”
The Cloudflare CEO emphasised that the company’s goal is to ensure uninterrupted traffic flow and that lessons learned from the outage will be used to enhance the resilience of Cloudflare’s global infrastructure.