How I Reduced Network Fault Investigation Time by 90% at Airtel Business

The Problem Nobody Had Written Down

When I joined Airtel Business as an Automation & Analytics Engineer, I inherited a process that nobody had formalized: when a network fault hit, analysts had to manually figure out which B2B enterprise customers were impacted. That meant opening multiple dashboards, cross-referencing network topology data with service lists, calling teams for context, and waiting for handoffs.

The process took between 3 and 5 hours on a good day. Nobody had a single source of truth. Field Ops, NOC, Product, and Automation teams each had their own view — and their own definition of what 'impacted' meant.

The product insight: this wasn't a data problem — we had the data. It was a workflow problem. Analysts were doing correlation work that a system could do automatically, more reliably, and in seconds.

Starting Like a PM: Discovery Before Building

Before touching any code, I spent two weeks mapping the existing workflow. I watched analysts work through incident investigations, asked NOC leads where decisions slowed down, and studied 3 months of historical incident tickets to quantify the opportunity.

Key findings from discovery: 80% of the investigation time was spent in 2 steps — cross-referencing fault location to topology maps, and then matching topology nodes to customer service configurations. Both were repeatable. Both could be automated.

3–5 hrsBefore: Investigation Time

4 teamsStakeholders Aligned

90%+Time Reduction

Writing the Spec Before the Code

I wrote a product spec before any engineering began. It defined: what the system would and would not do, what analyst validation would look like, how multi-vendor router complexity would be handled, and what the REST API contract between components would be.

One deliberate tradeoff in the spec: we would not try to fully automate the escalation decision. The MVP would surface correlated impact data clearly, allow analysts to validate quickly, and then let humans decide what to escalate. Trust first — broad automation later.

Shipping: Python, Selenium, and PuTTY

The automation used Python for orchestration, Selenium for data extraction from internal dashboards, subprocess integration with PuTTY for multi-vendor router queries, and REST API contracts for clean handoffs between components. Tableau dashboards surfaced threshold-based SLA signals for proactive monitoring.

Impact After Launch

Investigation time dropped from 3–5 hours to 29 minutes. The system is in live production use. Two critical SLA breach patterns were identified proactively before they became customer-reported tickets — something the previous workflow would never have caught in time.

The biggest lesson: internal tools win when they reduce decision time, not just manual effort. The automation was valuable because analysts could trust its output and act on it — not just because it ran faster.

How I Reduced Network Fault Investigation Time by 90% at Airtel Business

The Problem Nobody Had Written Down

Starting Like a PM: Discovery Before Building

Writing the Spec Before the Code

Shipping: Python, Selenium, and PuTTY

Impact After Launch

What Building Internal Tools at Airtel Taught Me About Product Discovery

From Research Lab to Product: Lessons from IIT Bombay

How I Reduced Network Fault Investigation Time by 90% at Airtel Business

The Problem Nobody Had Written Down

Starting Like a PM: Discovery Before Building

Writing the Spec Before the Code

Shipping: Python, Selenium, and PuTTY

Impact After Launch

Keep reading

What Building Internal Tools at Airtel Taught Me About Product Discovery

From Research Lab to Product: Lessons from IIT Bombay