MongoDB went down. Thousands of schools went offline. The team spent hours just figuring out why.
LearnWorlds is a white-label SaaS platform powering 12,000+ online schools. Over $1 billion in revenue generated, 32 million learners educated. The backend runs across 8 regional production clusters on GCP with a global MongoDB Atlas deployment. When a cluster degrades in EU-West or a shard fails in US-East, schools lose access to courses, payments break, and instructors can't see their dashboards. The engineering team tracks 26,000+ infrastructure entities across Kubernetes and MongoDB. During an incident, the data they need is spread across systems with no shared interface.
The challenge
When MongoDB degraded, the team knew schools were down. They didn't know why. Atlas showed cluster metrics. GCP showed infrastructure health. Kubernetes showed pod status. But nothing connected a deployment or a query pattern shift to the database degradation. Engineers would check capacity, networking, GCP issues, one by one. The actual cause was usually a missing index from changed query patterns. By the time they figured it out, schools had been down for hours.
A shard or region would degrade, schools would go offline, and the team had to manually search Atlas, GCP, and Kubernetes to figure out what changed. Eight clusters, no single view.
Missing indexes caused cascading read latency that looked like an infrastructure problem. The team chased capacity and networking issues first. The real cause was a query pattern change.
Thousands of learners could be affected before the right engineer noticed the right dashboard. No alerting connected MongoDB topology to what actually changed.
Incident collaboration was Atlas screenshots pasted into Slack. Every new engineer joining the thread started from scratch.
How they use NOFire
LearnWorlds connected GCP, Kubernetes, and MongoDB Atlas to NOFire. It indexes changes across all 8 clusters: deployments, scaling events, MongoDB operations, query pattern shifts. When a cluster degrades, the team gets the affected shards, the slow queries, and the change timeline in one Slack thread.
During MongoDB incidents:
Degraded shards and processes surfaced immediately. Slow queries identified. The change timeline shows what led to the failure, all in Slack.
The team can see the chain: which query pattern changed, which index was missing, which collections ran full scans. No more guessing if it's capacity, networking, or something else.
Cross-cluster incidents handled from a single thread. The deployment and scaling history across all 8 regions is already indexed.
Between incidents:
15 missing indexes found and applied across production clusters before they caused outages. 3 were on a single European cluster where the same cascading read latency had hit before.
Session traffic and capacity trends checked in Slack during planning calls. Regional growth gets caught before it turns into degraded query performance.
Slow query checks across all 8 clusters, from Slack, without opening Atlas.
The impact
Shards, queries, and the change that caused the problem. In minutes.: When a region degrades, the team sees which shards are hit and what changed. What used to be hours of correlating Atlas metrics with GCP logs now shows up in a Slack thread.
15 MongoDB indexes applied before any schools went down: Engineers surfaced the query patterns that had caused previous cascading failures and applied the indexes ahead of time. Not incident responses. Preventive fixes.
130+ hours of cross-system investigation replaced: 70+ investigations across Atlas, GCP Monitoring, and Kubernetes dashboards. Each one used to be a multi-tool exercise. Now it happens in the Slack channel where the incident is discussed.
Regional hotspot caught before it became an outage: Query volume in one growing cluster tripled in a single month. The trend showed up across scaling events and MongoDB metrics before it degraded into another schools-down situation.