The Problem Nobody Had Written Down
When I joined Airtel Business as an Automation & Analytics Engineer, I inherited a process that nobody had formalized: when a network fault hit, analysts had to manually figure out which B2B enterprise customers were impacted. That meant opening multiple dashboards, cross-referencing network topology data with service lists, calling teams for context, and waiting for handoffs.
The process took between 3 and 5 hours on a good day. Nobody had a single source of truth. Field Ops, NOC, Product, and Automation teams each had their own view — and their own definition of what 'impacted' meant.
Starting Like a PM: Discovery Before Building
Before touching any code, I spent two weeks mapping the existing workflow. I watched analysts work through incident investigations, asked NOC leads where decisions slowed down, and studied 3 months of historical incident tickets to quantify the opportunity.
Key findings from discovery: 80% of the investigation time was spent in 2 steps — cross-referencing fault location to topology maps, and then matching topology nodes to customer service configurations. Both were repeatable. Both could be automated.
Writing the Spec Before the Code
I wrote a product spec before any engineering began. It defined: what the system would and would not do, what analyst validation would look like, how multi-vendor router complexity would be handled, and what the REST API contract between components would be.
One deliberate tradeoff in the spec: we would not try to fully automate the escalation decision. The MVP would surface correlated impact data clearly, allow analysts to validate quickly, and then let humans decide what to escalate. Trust first — broad automation later.
Shipping: Python, Selenium, and PuTTY
The automation used Python for orchestration, Selenium for data extraction from internal dashboards, subprocess integration with PuTTY for multi-vendor router queries, and REST API contracts for clean handoffs between components. Tableau dashboards surfaced threshold-based SLA signals for proactive monitoring.
Impact After Launch
Investigation time dropped from 3–5 hours to 29 minutes. The system is in live production use. Two critical SLA breach patterns were identified proactively before they became customer-reported tickets — something the previous workflow would never have caught in time.