Back to BlogProduct Ops

How I Reduced Network Fault Investigation Time by 90% at Airtel Business

A deep dive into how I approached a painful 3–5 hour manual workflow, turned it into a product problem, and shipped an automation that now runs in production — cutting investigation to 29 minutes.

The Problem Nobody Had Written Down

When I joined Airtel Business as an Automation & Analytics Engineer, I inherited a process that nobody had formalized: when a network fault hit, analysts had to manually figure out which B2B enterprise customers were impacted. That meant opening multiple dashboards, cross-referencing network topology data with service lists, calling teams for context, and waiting for handoffs.

The process took between 3 and 5 hours on a good day. Nobody had a single source of truth. Field Ops, NOC, Product, and Automation teams each had their own view — and their own definition of what 'impacted' meant.

The product insight: this wasn't a data problem — we had the data. It was a workflow problem. Analysts were doing correlation work that a system could do automatically, more reliably, and in seconds.

Starting Like a PM: Discovery Before Building

Before touching any code, I spent two weeks mapping the existing workflow. I watched analysts work through incident investigations, asked NOC leads where decisions slowed down, and studied 3 months of historical incident tickets to quantify the opportunity.

Key findings from discovery: 80% of the investigation time was spent in 2 steps — cross-referencing fault location to topology maps, and then matching topology nodes to customer service configurations. Both were repeatable. Both could be automated.

3–5 hrsBefore: Investigation Time
4 teamsStakeholders Aligned
90%+Time Reduction

Writing the Spec Before the Code

I wrote a product spec before any engineering began. It defined: what the system would and would not do, what analyst validation would look like, how multi-vendor router complexity would be handled, and what the REST API contract between components would be.

One deliberate tradeoff in the spec: we would not try to fully automate the escalation decision. The MVP would surface correlated impact data clearly, allow analysts to validate quickly, and then let humans decide what to escalate. Trust first — broad automation later.

Shipping: Python, Selenium, and PuTTY

The automation used Python for orchestration, Selenium for data extraction from internal dashboards, subprocess integration with PuTTY for multi-vendor router queries, and REST API contracts for clean handoffs between components. Tableau dashboards surfaced threshold-based SLA signals for proactive monitoring.

Impact After Launch

Investigation time dropped from 3–5 hours to 29 minutes. The system is in live production use. Two critical SLA breach patterns were identified proactively before they became customer-reported tickets — something the previous workflow would never have caught in time.

The biggest lesson: internal tools win when they reduce decision time, not just manual effort. The automation was valuable because analysts could trust its output and act on it — not just because it ran faster.
All Articles

More Articles

Product Thinking6 min read

What Building Internal Tools at Airtel Taught Me About Product Discovery

Internal tools are underrated proving grounds for product thinking. Here's what I learned about discovery, stakeholder alignment, and shipping with trust when your users sit in the same office.

User Research5 min read

From Research Lab to Product: Lessons from IIT Bombay

Building a collaboration platform for research scientists at IIT Bombay's Nano-bios Lab taught me that the best product insight often comes from watching how people work around your assumptions.