Customers/Ergeon

701 alerts investigated. 74 were noise. The team didn't have to check any of them.

Home Services TechnologyCustomer story

Ergeon installs fences, artificial grass, decks, and concrete for homeowners across 30+ US states. 13,700+ reviews, same-day video quoting. Behind the consumer product: a Django API, Celery and Dramatiq workers, RabbitMQ, Redis, and a real-time contractor coordination layer. When the backend has a problem, customer notifications stop, contractor scheduling stalls, and project timelines slip.

The challenge

Every alert started with the same question: is this real? Thresholds set during initial deployment hadn't been updated in over a year. The system had grown. The alerts hadn't. Real incidents and false positives were indistinguishable from the surface.

Queue-length monitors firing on thresholds that hadn't been updated since the original deployment. The system outgrew the alert config.

Failure-rate alerts claiming 100% breakdown while the system processed normally. The metric was miscalculated. An engineer still had to verify that manually every time.

A silent pod termination causing a queue backlog looked exactly like a misconfigured alert. Without correlating CloudWatch events with queue metrics, no way to tell.

Every alert: pull metrics, check logs, review deployments, decide whether to escalate. 30-40 minutes each, real or not.

How they use NOFire

Ergeon connected AWS, Prometheus, CloudWatch, and their Kubernetes cluster to NOFire. It runs alongside their existing Slack workflow. When an alert fires, NOFire checks what changed recently, tests hypotheses against live metrics and logs, classifies the alert, and posts a triage summary. Engineers review the answer instead of doing the investigation.

On every alert:

Recent changes checked first: deployments, pod events, scaling activity. Then hypotheses tested against metrics and logs. Worker crashes, slowdowns, task spikes, config drift.

False positives classified with confidence scores and evidence. Most engineers don't bother verifying.

Real incidents arrive with a diagnosis and the affected services. Context is already there.

Over 8 months:

701 investigations. 96% completed with a structured triage outcome.

74 false positives closed without paging anyone.

Volume grew 6x. Not because more things broke. Engineers trusted the first-pass triage and started routing more alerts through.

701
Investigations automated
74
False positives caught
400+ hrs
Saved vs. manual triage

The impact

96% completion rate: 701 investigations, 96% completed with a triage outcome. The 4% that didn't finish still provided partial context for manual review.

74 false positives, zero pages: More than 1 in 10 were noise. Each one closed without a page or a manual metrics pull. 74 times an engineer wasn't woken up for nothing.

Same false positive, 14 times. Handled faster each time.: One queue-length alert fired 14 times on a stale threshold. By the third occurrence, the investigation was done in minutes. Engineers stopped treating it as real.

400+ hours back: 701 investigations at 30-40 minutes each if done manually. That time went to feature work and infrastructure improvements instead of on-call triage.

We used to jump between 5 dashboards trying to piece together what happened. Now we get the full picture in one view. Our on-call engineers fix things instead of escalating to everyone.
Odysseas Tsatalos · CTO, Ergeon

Run the runtime model
on your stack.

Book a demo