ServiceNow · OODP
AI-Native · Enterprise Observability
Transforming incident investigation
from reactive troubleshooting
to AI-assisted decision making
to AI-assisted decision making
Detection wasn't the bottleneck. Understanding was. I redesigned ServiceNow's event correlation experience — embedding AI trust signals and progressive disclosure into the triage workflow to cut MTTR from 14 minutes to 3.
↓ from 14 min
3 min
Mean Time to Resolution
↑ faster
78%
Reduction in investigation time
↑ at launch
92%
Adoption rate
6 mo
Discovery through delivery
Project Details
Final Design
Instance Performance Platform — the shipped experience
🔒
servicenow.com / now / cloud-observer / instance-performance / broadcom-csm-prod
🔍 Click to enlarge
Project Details
Role
Product Designer
End-to-end ownership across research, strategy, design & delivery
Team
Cross-functional
PM · 2 Engineering Leads · SRE, SAM & CS stakeholders
Platform
Observability
ServiceNow OODP · Instance Performance Platform
Design System
ServiceNow Horizon
Component library · Figma tokens · Accessibility standards
Timeline
6 Months
Discovery through delivery · 2024
🛠️
Design tools
Figma · Miro · Adobe XD
Figma
Miro
Adobe XD
Prototype
🔬
Research methods
Interviews · Incident shadowing · Journey mapping
User interviews
Incident shadowing
Journey mapping
Research & Discovery
Designing persona-specific questions to uncover meaningful insights
Based on the key personas and their operational contexts, I created targeted interview questionnaires — not generic surveys, but role-specific probes designed to surface the mental models each persona used during incidents.
Tailoring questions to each role helped uncover persona-specific needs, cross-functional dependencies, and high-impact opportunities for AI-driven workflow improvements.
Focus area 01
Core responsibilities & day-to-day workflows
Focus area 02
Pain points & recurring challenges
Focus area 03
Tool usage & data interaction patterns
Focus area 04
AI-native opportunities & automation potential
Focus area 05
Expectations around trust, transparency & AI assistance
Persona interview breakdown
Site Reliability Engineer · SRE
Operational depth & MTTR ownership
Ensure reliability, uptime, and SLA compliance. Manage alerts, logs, traces. Conduct RCA and postmortems.
What are your key reliability goals — MTTR, SLAs?
What's the hardest part of detecting an incident?
What would make AI insights trustworthy to you?
Support Account Manager · SAM
Client health & cross-functional reporting
Manage customer health and satisfaction. Communicate incident updates to clients. Act as liaison between technical and business teams.
How do you track service health for clients?
Where do you face communication gaps with SREs?
What would AI-written summaries need to include?
Customer Support Engineer · CSE
First-line triage & escalation
Provide first-line technical support. Escalate unresolved incidents. Document and communicate resolutions. Maintain response SLAs.
What makes issue triage slow for you?
How do you correlate multiple data sources?
What guardrails do you need before trusting AI?
Key insights from user interviews
Four findings shaped the entire design strategy
1
Investigations were manual correlation exercises
Engineers rebuilt event timelines by hand every time — context-switching across dashboards, jobs, and infrastructure views with no native way to understand how events related to each other.
2
Time window, not trends, drove decisions
During active incidents, engineers narrowed to the immediate degradation window. Any design that surfaced historical context before current context was working against natural investigation behavior.
3
Personas needed fundamentally different narratives
Engineers needed raw telemetry depth. Ops and support stakeholders needed contextual summaries they could act on without interpreting raw data. A single-view solution would serve neither well.
4
AI trust required visible reasoning
Users weren't resistant to AI assistance — they were resistant to opaque conclusions. When AI showed its reasoning, trust followed. When it didn't, the AI was ignored entirely, regardless of accuracy.
Research · Persona Dossiers
Three personas. Three fundamentally different needs.
Each persona required a distinct cognitive depth — from raw telemetry for SREs to executive summaries for TDOs. Understanding their workflows, friction points, and AI expectations shaped every design decision.
SRE · Persona Dossier
🔍 Enlarge
SRE
Sr Manager, Site Reliability Engineering
GSC Cloud Operations · India · 8–12 years experience
Eyrie (Alert Engine)
StatsNow
Grafana
Prometheus
Primary goals
- Single pane of glass — service health with drill-downs (org → service → region → instance)
- Proactive operations — predict issues (disk full, CPU growth) and act before SLO impact
- Reduce MTTR and context switching; standardise dashboards
- Clear SLI/SLO/error-budget visibility for engineers and leadership
Friction points
- Heavy tool-switching: Eyrie → StatsNow → Grafana → Prometheus → KBs
- Manual service-health assessment; no unified scoring
- Fragmented, non-standard Grafana dashboards; limited end-to-end troubleshooting views
- Reactive troubleshooting despite proactive ambitions, due to fragmented tooling
AI-native opportunities
- Prefers advisory AI (suggested remediation) over fully autonomous fixes
- Predictive analytics: growth/usage forecasts with "time to failure/impact"
- Alert correlation: group by service/incident; reduce noise and missed links
- RCA acceleration: timeline reconstruction, dependency/impact mapping, KB linking
"We need everything in one place, a single pane of glass for service health."
"Give us predictions — if we don't act now, when will it fail?"
CS Performance · Persona Dossier
🔍 Enlarge
CSP
Staff / Principal Technical Support Engineer
TechSupport Backend · Global Team · Senior (8–14 years)
StatsNow
RCC
Grafana
NOW Support (TNG)
CLI / SSH
Ruckus
Primary goals
- Reduce MTTR — improve first-time resolution through better context and deterministic guidance
- Self-explanatory systems — make observability intelligent, minimising manual correlation
- Less noise — reduce alert volume and redundant tool-hopping
- Enable SAMs to make context-rich case submissions, reducing unnecessary escalations
Friction points
- Alert enrichments appear as raw JSON — "why it breached" is unclear to most users
- Verification hop between StatsNow/Grafana — choosing the right window is hard
- Dashboards don't highlight which metrics matter now; multiple symptoms create ambiguity
- Isolated case investigations — limited visibility into pattern recurrences across customers
AI-native opportunities
- Auto-suggest relevant timeframes and metrics using anomaly clustering
- Conversational RCA: "Why did this alert trigger?" → AI summarises root cause and linked incidents
- Pattern recognition on stack traces or queries → auto-suggest related PRBs or defects
- Role-based summaries: SAM = overview, Engineer = detail + links
"Show me the cause in one glance, the window I should inspect, and the next thing to do — then let me go deep if I need to."
TDO · Persona Dossier
🔍 Enlarge
TDO
Senior Manager, Site Reliability Engineering Management
1500 GCS Cloud Operations · Dublin, IRE (Dawson) · 8–15 years
StatsNow (Cloud Observer)
Eyrie (Alerts & Tasks)
SRE Handover App
Microsoft Teams
Splunk
Bridge Manager
Primary goals
- Minimise customer impact — shorten detect → resolve cycle
- Situational awareness — accurate, timely status for execs without constant pings
- Reduce context-switching across tools
- Prevent incidents via capacity hygiene and trend visibility
Friction points
- Heavy tool-hopping: Eyrie ↔ Teams ↔ Handover ↔ Cloud Observer ↔ Splunk ↔ Bridge email
- Eyrie alert latency (~6–7 min) and occasional missed clears
- Bridge comms live in email — execs ping TDOs for status
- No unified "one-page" workspace tailored by role
AI-native opportunities
- Cross-signal anomaly detection + change correlation; predict blast radius
- Live Bridge panel (timeline, join, comms) inside product
- Guardrailed auto-remediation for low-risk runbooks (restart/drain/scale)
- Incident narrative and post-incident summaries; suggested next steps
"There's a lot of clicking and bouncing around to get what we need."
"The goal would be to bring them all together... one page."
The Strategic Reframe
Research didn't just surface pain points — it clarified the real product opportunity.
The gap wasn't in data quality or alert accuracy. It was in the experience between alert and understanding. I reframed the entire design strategy around one shift: from monitoring-first to investigation-first.
I established four design principles to make that concrete — not as documentation, but as decision filters used in every review from that point forward.
The core shift
Monitoring-first
→
Investigation-first
Display data
→
Answer questions
Historical context first
→
Current window first
One view for all
→
Persona-adaptive narratives
Opaque AI conclusions
→
Explainable AI reasoning
Four principles · every decision
01 · Investigation before visualisation
Every element must accelerate a decision
Every element should help a user answer a question, not just display data. If a visualisation didn't accelerate an investigation decision, it didn't belong on the screen.
02 · Time as context
Sequences and causality must be legible at a glance
Sequences and relationships matter more than isolated data points. Event ordering and causality needed to be legible at a glance — not reconstructed mentally by the engineer.
03 · Progressive disclosure
Detail on demand, not by default
The same event correlation data needed to render at different cognitive depths depending on who was looking at it. Detail on demand — not forced on every user at every moment.
04 · Explainable AI
Every AI output must show its reasoning
What did I find. Why are these events related. What should you look at next. No opaque conclusions — every AI annotation linked back to the raw signals that generated it.
The platform was already generating AI correlations. Engineers were ignoring them — not because they were wrong, but because there was no way to verify them. The design problem wasn't AI capability. It was AI legibility.
— Research synthesis · Design strategy framing
The Key Design Decision
The hardest call: who initiates the investigation?
Before a screen was designed, the team faced a foundational question that shaped every interaction pattern that followed. Early stakeholder position: let the AI drive. I pushed back — with the research.
❌ Early stakeholder position
AI-initiated
Let the AI drive. Surface proactive correlations automatically. Push insights before engineers even begin investigating. The AI decides what matters first — engineers receive conclusions, not discoveries.
Not trusted
Felt conclusive
Engineers disengaged
vs
✓ Research-backed decision
User-initiated + AI-accelerated
User-triggered investigation workflows with AI-accelerated correlation layered on top. Proactive AI summaries available as optional entry point — never mandatory. Engineers remain in control of the investigation frame at all times.
Engineers in control
AI as accelerator
Trust earned
The resolution · Hybrid model
User-triggered investigation workflows with AI-accelerated correlation
The incident shadowing had shown that when engineers didn't initiate an investigation themselves, they didn't trust its frame. AI-surfaced correlations felt like conclusions handed to them rather than discoveries they could verify. There was a learned skepticism toward automated root-cause analysis in the SRE community that couldn't be engineered away — it had to be designed around.
This wasn't a compromise. It was the right answer — and it required aligning design, PM, engineering, and observability architects who all had different initial instincts about where the AI should sit.
What the hybrid model meant in practice
🎯
Engineer triggers the frame
Investigation starts when the user opens a correlation workflow from an alert. They choose the starting point — not the AI. The frame of investigation is always engineer-owned.
⚡
AI accelerates within that frame
Once initiated, AI-annotated correlation chains surface related events, ranked by relevance — giving engineers 10× investigative reach without any loss of control or verification ability.
📋
Proactive summaries are optional
AI-generated summaries are available as an optional entry point — genuinely useful for SAMs and CSEs, but never forced on SREs who prefer to start from raw signal-first.
From hypothesis to shipped experience
To accelerate alignment, the PM team used AI-assisted ideation to create exploratory operational scenarios — not designs, but hypotheses used to validate workflows against research findings.
My role was to challenge each concept against real investigation behaviors — testing progressive disclosure, AI explainability, and differences between SRE and ops workflows. Early concepts failed because they surfaced conclusions without evidence trails and organised information by signal type instead of investigation timeline. These failures clarified what the experience needed to become.
⚡ Key debate · Information hierarchy
Raw telemetry first vs. correlation narrative first
Engineering stakeholders preferred exposing raw telemetry first, with AI correlations as secondary context. Research and incident shadowing showed something different: engineers use raw signals to validate hypotheses — not create them.
Resolution: narrative-first approach — AI correlation leads, raw telemetry reveals on demand. Advocated for based on research evidence — and it prevailed. Validated in V3 and the final shipped experience.
Iteration Timeline
V1
Early concepts
Hypothesis validation
Hypothesis validation
V2
Timeline-based
Investigation flow
Investigation flow
V3
Persona-adaptive · Narrative-first · Shipped
V1
Early concepts — hypothesis validation
AI-assisted ideation used to explore workflows against research. Concepts organised events by signal type — alerts grouped with alerts, logs with logs. Failed because it mirrored the existing tool-switching problem. No evidence trails behind AI conclusions, so engineers couldn't verify them. Failures clarified what the experience needed to become.
Failed · signal-type org
Learnings captured
Not shipped
🔍
V1 · Alert correlation overview — signal-type grouping
🔍
V1 · Metrics drill-down — no causal chain visible
→ Scroll · Click any screen to enlarge
V2
Timeline-based investigation flow
Established a chronological event timeline with AI-generated relationship annotations — causal arrows showing why signals were grouped. Engineers could read the incident as a narrative rather than a list of disconnected signals. Reviewed with design leadership and observability architects. Validated the timeline-first mental model.
Timeline-first
AI annotations added
Iteration
🔍
V2 · Timeline investigation — AI anomaly signal panel
🔍
V2 · AI insight drawer — root cause chain with context
→ Scroll · Click any screen to enlarge
V3
Persona-adaptive views · Narrative-first · Shipped
Introduced persona-adaptive views — deep telemetry narrative for SREs, action-oriented briefings for SAMs and CSEs. Elevated AI investigation summaries into primary entry point. Adopted narrative-first: AI correlation leads, raw telemetry on demand. Validated with product stakeholders and observability architects. MTTR reduced from 14 to 3 minutes. 92% adoption at launch.
Shipped
Research-validated
Persona-adaptive
Shipped
🔍
Final · Full dashboard — AI-native event correlation
🔍
Final · Instance Performance overview — Alerts, Diagnostics
🔍
Final · AI insight panel — root cause chain & recommended actions
🔍
Final · Causal chain reconstruction — 8-step propagation timeline
→ Scroll · Click any screen to enlarge
The Final Experience
A purpose-built investigation experience layered onto the Instance Performance platform
The final design introduced a dedicated Event Correlation workflow — accessible from any alert, anomaly, or degradation signal. Engineers could move from an alert notification directly into a timeline of related events, with AI-annotated relationship chains showing why signals were grouped.
How the experience worked end-to-end
1
Alert → Correlation timeline in one click
Engineers move directly from an alert notification into a timeline of related events. No tab-switching. No manual reconstruction. The investigation starts the moment the alert is opened — context already assembled.
2
AI-annotated relationship chains
Every event group shows the AI's reasoning: what signals are correlated, confidence level, and the raw evidence behind the conclusion. Engineers can expand to verify or dismiss with a single click — always in control.
3
Role-calibrated AI summaries
AI-generated summaries adapt to the user's role context — deep telemetry narrative for SREs, action-oriented briefings for SAMs and CSEs. Same underlying data, different cognitive depth per persona.
4
Raw telemetry on demand
Any individual signal can be drilled into for full logs, metrics, and traces. Full depth always available — never forced upfront. Detail on demand, not by default. Engineers decide when to go deeper.
5
Confidence levels & evidence trails
Every AI annotation links back to the raw signals that generated it. Correlation chains could be expanded or dismissed. Summaries showed confidence scores. Every claim was verifiable — trust built through transparency.
Persona-specific views
SRE · Site Reliability Engineer
Deep telemetry narrative
Full event timeline with causal chain
AI correlation with expandable evidence
Raw logs, metrics & traces on demand
Before / during / after comparison view
SAM · Support Account Manager
Action-oriented summary
Plain-language incident summary
Customer impact scope and duration
Current status & escalation state
Recommended client communication
CSE · Customer Support Engineer
Guided triage flow
AI-suggested root cause candidates
Similar past incident references
Next steps & escalation triggers
Confidence indicators on AI outputs
Interaction design · behavioral messages reinforced
🎯
Behavioral message
You are in control
Correlation chains could be expanded or dismissed. Summaries showed confidence levels. Every AI annotation linked back to the raw signals that generated it.
🤖
Behavioral message
The AI is working for you
AI-surfaced correlations felt like discoveries engineers could verify — not conclusions handed to them. The investigation frame was always engineer-owned.
🔐
Behavioral message
Trust is earned incrementally
Confidence levels, expandable reasoning, and evidence trails built trust through transparency — not assertion. Every small interaction reinforced one message.
Impact
14→3
Minutes · Mean Time to Resolution
78%
Reduction in investigation time
92%
Adoption rate at launch
3
Personas served by one adaptive experience
Prototype Walkthrough
The Design in Action
Explore the AI-native event correlation workflow — from alert trigger to causal chain reconstruction, role-adaptive summaries, and evidence-backed AI reasoning.
🔒
mahenderuxdesigner.com / portfolio / ai-native-event-correlation
New Tab
AI-Native Event Correlation — ServiceNow
Cloud Observer
Reflection
What this project taught me about designing for understanding
The deepest lesson wasn't about AI, or observability, or enterprise SaaS specifically. It was about the difference between a data problem and an understanding problem — and how rarely those two things are the same.
Six things I'm carrying forward
🔍
Data and understanding are not the same thing
The platform had telemetry. Engineers had skills. Signals were firing at scale. Incidents still took 14 minutes to resolve — because no one had designed for the cognitive work between alert and answer.
🧭
The gap lives where metrics don't capture
That gap lives in a place that monitoring metrics don't capture and feature requests rarely name. Finding it required research that went beyond workflows into mental models.
🤝
AI trust is a design problem, not an engineering one
Users weren't resistant to AI. They were resistant to opacity. The moment reasoning became visible — confidence levels, evidence trails, annotations — skepticism turned to adoption.
🏗️
The decisions before any screen is drawn matter most
The hardest part wasn't the AI layer. It was defending the investigation-first frame long enough for the team to see it validated. That's the work that determined everything downstream.
👥
Personas are architecturally different, not just stylistically
SREs and SAMs don't need different polish on the same information. They need fundamentally different narratives. Progressive disclosure had to work at data-structure level, not just UI level.
📐
Principles must be decision filters, not documentation
The four design principles only worked because they were used as active filters in every design review — not written once and filed away. That's what made them real constraints.
Design principles · validated outcomes
01
Investigation before visualisation
Every screen element tied to an investigation question. Removed 3 visualisation components in V2 review that displayed data without accelerating a decision.
02
Time as context
Chronological event timeline became the primary navigation model. Engineers could read incident causality without mental reconstruction — critical during high-stress P1 incidents.
03
Progressive disclosure
Same correlation data rendered at three depths across SRE, SAM, and CSE. Role context determined cognitive depth — not a toggle or preference setting.
04
Explainable AI
Every AI output linked to the raw signals that generated it. Confidence scores displayed alongside conclusions. Engineers went from ignoring AI to actioning it — 92% adoption in 60 days.
Final thought
"The hardest part of this project wasn't the AI layer. It was defending the investigation-first frame long enough for the team to see it validated."
That's the work I find most meaningful — the decisions made before any screen is drawn. A design strategy that prioritised human reasoning over data completeness. Research that went beyond workflows into mental models. And the discipline to hold a frame under pressure long enough to see it proved right.
Mahender Kommaganti · Sr. UX Designer
ServiceNow · OODP · 2024
6 months · End-to-end ownership