Every release broke RDS. The team scaled the database, closed the ticket, and waited for it to happen again.
HarborLab manages disbursements, port costs, and vessel operations for 4,000 registered vessels, 5,000 port agents, and 35,000 port calls across global shipping. 20+ production microservices on AWS in eu-west-1, backed by RDS Aurora, Kubernetes with 46+ deployments, Auto Scaling Groups peaking at 24 instances. Grafana Cloud for observability. When a connection pool issue hits during peak hours, port agents closing operations across multiple time zones are waiting on delayed calculations.
The challenge
Multiple services shared one RDS instance. After every release, connections maxed out. The team deployed, watched RDS strain, scaled the database manually, waited for things to settle. Port cost calculations slowed. Agents across time zones waited. Nobody could tell which deployment did it.
The answer was always spread across Loki, CloudWatch, and GitHub. Engineers searched all three trying to figure out which service was draining connections. Never one place.
Any of the 20+ services could have caused it. Any recent change in any of them. Narrowing it down meant checking each one.
There was no way to connect a commit to a deployment to an RDS metric. The team had to reconstruct that chain manually every time. Nobody ever finished.
So they scaled RDS, resolved the symptom, and moved on. Next release, same thing.
How they use NOFire
HarborLab connected AWS, Kubernetes, and Grafana Cloud to NOFire. First incident after onboarding: 3 minutes to find the change, 15 minutes to resolve. They reverted the commit and the incident stopped recurring.
What showed up in 3 minutes:
Which service was draining connections.
The commit that changed the connection pool configuration.
The deployment that rolled it out.
RDS connection metrics aligned with the deployment timeline.
Ongoing:
10,000+ changes tracked across EC2, RDS Aurora, ASGs, and Kubernetes. The context is there before anything breaks.
3 RDS failover events correlated with upstream changes. No manual timeline work.
35+ investigations across storage, CPU, latency, and pod alerts. Each one traced to the relevant change.
The impact
Days to 3 minutes: The team reverted the commit and deployed a fix the same day. What used to take days of searching across three tools was done before the on-call engineer finished reading the alert.
The change fixed, not the symptom: Previous cycles ended with scaling RDS and closing the ticket. NOFire showed the connection pool config change, the deployment, and the metric spike in one chain. The team fixed the actual problem.
10,000+ changes indexed before alerts fire: Deployments, scaling events, config changes. When something breaks, the relevant change is already there. On busy days, over 1,000 changes processed in 24 hours.
100+ hours back across 35+ investigations: Each one replaced the usual exercise of checking Loki, then CloudWatch, then GitHub. Answers show up in the same Slack channel where the alert fired.
“We found the exact change that kept breaking RDS and fixed it. Releases don't end with panic and manual scaling anymore.”Spyros Lamprinidis · CTO, HarborLab