HarborLab Customer Story | NOFire AI

HarborLab manages disbursements, port costs, and vessel operations for 4,000 registered vessels, 5,000 port agents, and 35,000 port calls across global shipping. 20+ production microservices on AWS in eu-west-1, backed by RDS Aurora, Kubernetes with 46+ deployments, Auto Scaling Groups peaking at 24 instances. Grafana Cloud for observability. When a connection pool issue hits during peak hours, port agents closing operations across multiple time zones are waiting on delayed calculations.

The challenge

Multiple services shared one RDS instance. After every release, connections maxed out. The team deployed, watched RDS strain, scaled the database manually, waited for things to settle. Port cost calculations slowed. Agents across time zones waited. Nobody could tell which deployment did it.

The answer was always spread across Loki, CloudWatch, and GitHub. Engineers searched all three trying to figure out which service was draining connections. Never one place.

Any of the 20+ services could have caused it. Any recent change in any of them. Narrowing it down meant checking each one.

There was no way to connect a commit to a deployment to an RDS metric. The team had to reconstruct that chain manually every time. Nobody ever finished.

So they scaled RDS, resolved the symptom, and moved on. Next release, same thing.

How they use NOFire

HarborLab connected AWS, Kubernetes, and Grafana Cloud to NOFire. First incident after onboarding: 3 minutes to find the change, 15 minutes to resolve. They reverted the commit and the incident stopped recurring.

What showed up in 3 minutes:

Which service was draining connections.

The commit that changed the connection pool configuration.

The deployment that rolled it out.

RDS connection metrics aligned with the deployment timeline.

Ongoing:

10,000+ changes tracked across EC2, RDS Aurora, ASGs, and Kubernetes. The context is there before anything breaks.

3 RDS failover events correlated with upstream changes. No manual timeline work.

35+ investigations across storage, CPU, latency, and pod alerts. Each one traced to the relevant change.

3 min

To trace the change (from days)

35+

Investigations completed

100+ hrs

Saved vs. manual cross-tool work

The impact

Days to 3 minutes: The team reverted the commit and deployed a fix the same day. What used to take days of searching across three tools was done before the on-call engineer finished reading the alert.

The change fixed, not the symptom: Previous cycles ended with scaling RDS and closing the ticket. NOFire showed the connection pool config change, the deployment, and the metric spike in one chain. The team fixed the actual problem.

10,000+ changes indexed before alerts fire: Deployments, scaling events, config changes. When something breaks, the relevant change is already there. On busy days, over 1,000 changes processed in 24 hours.

100+ hours back across 35+ investigations: Each one replaced the usual exercise of checking Loki, then CloudWatch, then GitHub. Answers show up in the same Slack channel where the alert fired.

Every release broke RDS. The team scaled the database, closed the ticket, and waited for it to happen again.

The challenge

How they use NOFire

The impact

Run the runtime model
on your stack.

Every release broke RDS. The team scaled the database, closed the ticket, and waited for it to happen again.

The challenge

How they use NOFire

The impact

Run the runtime modelon your stack.

Run the runtime model
on your stack.