AI Native Event Correlation for Enterprise Observability

01 · Hero (Updated)

ServiceNow · OODP AI-Native · Enterprise Observability

Transforming incident investigation from reactive troubleshooting
to AI-assisted decision making

Detection wasn't the bottleneck. Understanding was. I redesigned ServiceNow's event correlation experience — embedding AI trust signals and progressive disclosure into the triage workflow to cut MTTR from 14 minutes to 3.

↓ from 14 min

3 min

Mean Time to Resolution

↑ faster

78%

Reduction in investigation time

↑ at launch

92%

Adoption rate

6 mo

Discovery through delivery

Project Details

Final UI — Instance Performance Platform

Instance Performance Platform — Final UI

Final Design

Instance Performance Platform — the shipped experience

🔒 servicenow.com / now / cloud-observer / instance-performance / broadcom-csm-prod

Instance Performance Platform — Final Shipped UI

🔍 Click to enlarge

01b · Project Details

Project Details

Role

Product Designer

End-to-end ownership across research, strategy, design & delivery

Team

Cross-functional

PM · 2 Engineering Leads · SRE, SAM & CS stakeholders

Platform

Observability

ServiceNow OODP · Instance Performance Platform

Design System

ServiceNow Horizon

Component library · Figma tokens · Accessibility standards

Timeline

6 Months

Discovery through delivery · 2024

🛠️

Design tools

Figma · Miro · Adobe XD

Figma Miro Adobe XD Prototype

🔬

Research methods

Interviews · Incident shadowing · Journey mapping

User interviews Incident shadowing Journey mapping

03 · Research

Research & Discovery

Designing persona-specific questions to uncover meaningful insights

Based on the key personas and their operational contexts, I created targeted interview questionnaires — not generic surveys, but role-specific probes designed to surface the mental models each persona used during incidents.

Tailoring questions to each role helped uncover persona-specific needs, cross-functional dependencies, and high-impact opportunities for AI-driven workflow improvements.

Focus area 01

Core responsibilities & day-to-day workflows

Focus area 02

Pain points & recurring challenges

Focus area 03

Tool usage & data interaction patterns

Focus area 04

AI-native opportunities & automation potential

Focus area 05

Expectations around trust, transparency & AI assistance

Persona interview breakdown

Site Reliability Engineer · SRE

Operational depth & MTTR ownership

Ensure reliability, uptime, and SLA compliance. Manage alerts, logs, traces. Conduct RCA and postmortems.

What are your key reliability goals — MTTR, SLAs?

What's the hardest part of detecting an incident?

What would make AI insights trustworthy to you?

Support Account Manager · SAM

Client health & cross-functional reporting

Manage customer health and satisfaction. Communicate incident updates to clients. Act as liaison between technical and business teams.

How do you track service health for clients?

Where do you face communication gaps with SREs?

What would AI-written summaries need to include?

Customer Support Engineer · CSE

First-line triage & escalation

Provide first-line technical support. Escalate unresolved incidents. Document and communicate resolutions. Maintain response SLAs.

What makes issue triage slow for you?

How do you correlate multiple data sources?

What guardrails do you need before trusting AI?

Key insights from user interviews

Four findings shaped the entire design strategy

Investigations were manual correlation exercises

Engineers rebuilt event timelines by hand every time — context-switching across dashboards, jobs, and infrastructure views with no native way to understand how events related to each other.

Time window, not trends, drove decisions

During active incidents, engineers narrowed to the immediate degradation window. Any design that surfaced historical context before current context was working against natural investigation behavior.

Personas needed fundamentally different narratives

Engineers needed raw telemetry depth. Ops and support stakeholders needed contextual summaries they could act on without interpreting raw data. A single-view solution would serve neither well.

AI trust required visible reasoning

Users weren't resistant to AI assistance — they were resistant to opaque conclusions. When AI showed its reasoning, trust followed. When it didn't, the AI was ignored entirely, regardless of accuracy.

Research · Persona Dossiers

Three personas. Three fundamentally different needs.

Each persona required a distinct cognitive depth — from raw telemetry for SREs to executive summaries for TDOs. Understanding their workflows, friction points, and AI expectations shaped every design decision.

SRE · Persona Dossier 🔍 Enlarge

SRE Sr Manager, Site Reliability Engineering GSC Cloud Operations · India · 8–12 years experience

Eyrie (Alert Engine) StatsNow Grafana Prometheus

Primary goals

Single pane of glass — service health with drill-downs (org → service → region → instance)
Proactive operations — predict issues (disk full, CPU growth) and act before SLO impact
Reduce MTTR and context switching; standardise dashboards
Clear SLI/SLO/error-budget visibility for engineers and leadership

Friction points

Heavy tool-switching: Eyrie → StatsNow → Grafana → Prometheus → KBs
Manual service-health assessment; no unified scoring
Fragmented, non-standard Grafana dashboards; limited end-to-end troubleshooting views
Reactive troubleshooting despite proactive ambitions, due to fragmented tooling

AI-native opportunities

Prefers advisory AI (suggested remediation) over fully autonomous fixes
Predictive analytics: growth/usage forecasts with "time to failure/impact"
Alert correlation: group by service/incident; reduce noise and missed links
RCA acceleration: timeline reconstruction, dependency/impact mapping, KB linking

"We need everything in one place, a single pane of glass for service health." "Give us predictions — if we don't act now, when will it fail?"

CS Performance · Persona Dossier 🔍 Enlarge

CSP Staff / Principal Technical Support Engineer TechSupport Backend · Global Team · Senior (8–14 years)

StatsNow RCC Grafana NOW Support (TNG) CLI / SSH Ruckus

Primary goals

Reduce MTTR — improve first-time resolution through better context and deterministic guidance
Self-explanatory systems — make observability intelligent, minimising manual correlation
Less noise — reduce alert volume and redundant tool-hopping
Enable SAMs to make context-rich case submissions, reducing unnecessary escalations

Friction points

Alert enrichments appear as raw JSON — "why it breached" is unclear to most users
Verification hop between StatsNow/Grafana — choosing the right window is hard
Dashboards don't highlight which metrics matter now; multiple symptoms create ambiguity
Isolated case investigations — limited visibility into pattern recurrences across customers

AI-native opportunities

Auto-suggest relevant timeframes and metrics using anomaly clustering
Conversational RCA: "Why did this alert trigger?" → AI summarises root cause and linked incidents
Pattern recognition on stack traces or queries → auto-suggest related PRBs or defects
Role-based summaries: SAM = overview, Engineer = detail + links

"Show me the cause in one glance, the window I should inspect, and the next thing to do — then let me go deep if I need to."

TDO · Persona Dossier 🔍 Enlarge

TDO Senior Manager, Site Reliability Engineering Management 1500 GCS Cloud Operations · Dublin, IRE (Dawson) · 8–15 years

StatsNow (Cloud Observer) Eyrie (Alerts & Tasks) SRE Handover App Microsoft Teams Splunk Bridge Manager

Primary goals

Minimise customer impact — shorten detect → resolve cycle
Situational awareness — accurate, timely status for execs without constant pings
Reduce context-switching across tools
Prevent incidents via capacity hygiene and trend visibility

Friction points

Heavy tool-hopping: Eyrie ↔ Teams ↔ Handover ↔ Cloud Observer ↔ Splunk ↔ Bridge email
Eyrie alert latency (~6–7 min) and occasional missed clears
Bridge comms live in email — execs ping TDOs for status
No unified "one-page" workspace tailored by role

AI-native opportunities

Cross-signal anomaly detection + change correlation; predict blast radius
Live Bridge panel (timeline, join, comms) inside product
Guardrailed auto-remediation for low-risk runbooks (restart/drain/scale)
Incident narrative and post-incident summaries; suggested next steps

"There's a lot of clicking and bouncing around to get what we need." "The goal would be to bring them all together... one page."

04 · Strategy

The Strategic Reframe

Research didn't just surface pain points — it clarified the real product opportunity.

The gap wasn't in data quality or alert accuracy. It was in the experience between alert and understanding. I reframed the entire design strategy around one shift: from monitoring-first to investigation-first.

I established four design principles to make that concrete — not as documentation, but as decision filters used in every review from that point forward.

The core shift

Monitoring-first → Investigation-first

Display data → Answer questions

Historical context first → Current window first

One view for all → Persona-adaptive narratives

Opaque AI conclusions → Explainable AI reasoning

Four principles · every decision

01 · Investigation before visualisation

Every element must accelerate a decision

Every element should help a user answer a question, not just display data. If a visualisation didn't accelerate an investigation decision, it didn't belong on the screen.

02 · Time as context

Sequences and causality must be legible at a glance

Sequences and relationships matter more than isolated data points. Event ordering and causality needed to be legible at a glance — not reconstructed mentally by the engineer.

03 · Progressive disclosure

Detail on demand, not by default

The same event correlation data needed to render at different cognitive depths depending on who was looking at it. Detail on demand — not forced on every user at every moment.

04 · Explainable AI

Every AI output must show its reasoning

What did I find. Why are these events related. What should you look at next. No opaque conclusions — every AI annotation linked back to the raw signals that generated it.

The platform was already generating AI correlations. Engineers were ignoring them — not because they were wrong, but because there was no way to verify them. The design problem wasn't AI capability. It was AI legibility.

— Research synthesis · Design strategy framing

05 · Decision

The Key Design Decision

The hardest call: who initiates the investigation?

Before a screen was designed, the team faced a foundational question that shaped every interaction pattern that followed. Early stakeholder position: let the AI drive. I pushed back — with the research.

❌ Early stakeholder position

AI-initiated

Let the AI drive. Surface proactive correlations automatically. Push insights before engineers even begin investigating. The AI decides what matters first — engineers receive conclusions, not discoveries.

Not trusted Felt conclusive Engineers disengaged

✓ Research-backed decision

User-initiated + AI-accelerated

User-triggered investigation workflows with AI-accelerated correlation layered on top. Proactive AI summaries available as optional entry point — never mandatory. Engineers remain in control of the investigation frame at all times.

Engineers in control AI as accelerator Trust earned

The resolution · Hybrid model

User-triggered investigation workflows with AI-accelerated correlation

The incident shadowing had shown that when engineers didn't initiate an investigation themselves, they didn't trust its frame. AI-surfaced correlations felt like conclusions handed to them rather than discoveries they could verify. There was a learned skepticism toward automated root-cause analysis in the SRE community that couldn't be engineered away — it had to be designed around.

This wasn't a compromise. It was the right answer — and it required aligning design, PM, engineering, and observability architects who all had different initial instincts about where the AI should sit.

What the hybrid model meant in practice

🎯

Engineer triggers the frame

Investigation starts when the user opens a correlation workflow from an alert. They choose the starting point — not the AI. The frame of investigation is always engineer-owned.

⚡

AI accelerates within that frame

Once initiated, AI-annotated correlation chains surface related events, ranked by relevance — giving engineers 10× investigative reach without any loss of control or verification ability.

📋

Proactive summaries are optional

AI-generated summaries are available as an optional entry point — genuinely useful for SAMs and CSEs, but never forced on SREs who prefer to start from raw signal-first.

06b · Design Iterations

From hypothesis to shipped experience

To accelerate alignment, the PM team used AI-assisted ideation to create exploratory operational scenarios — not designs, but hypotheses used to validate workflows against research findings.

My role was to challenge each concept against real investigation behaviors — testing progressive disclosure, AI explainability, and differences between SRE and ops workflows. Early concepts failed because they surfaced conclusions without evidence trails and organised information by signal type instead of investigation timeline. These failures clarified what the experience needed to become.

⚡ Key debate · Information hierarchy

Raw telemetry first vs. correlation narrative first

Engineering stakeholders preferred exposing raw telemetry first, with AI correlations as secondary context. Research and incident shadowing showed something different: engineers use raw signals to validate hypotheses — not create them.

Resolution: narrative-first approach — AI correlation leads, raw telemetry reveals on demand. Advocated for based on research evidence — and it prevailed. Validated in V3 and the final shipped experience.

Iteration Timeline

Early concepts
Hypothesis validation

Timeline-based
Investigation flow

Persona-adaptive · Narrative-first · Shipped

Early concepts — hypothesis validation

AI-assisted ideation used to explore workflows against research. Concepts organised events by signal type — alerts grouped with alerts, logs with logs. Failed because it mirrored the existing tool-switching problem. No evidence trails behind AI conclusions, so engineers couldn't verify them. Failures clarified what the experience needed to become.

Failed · signal-type org Learnings captured

Not shipped

🔍

V1 · Alert correlation overview — signal-type grouping

🔍

V1 · Metrics drill-down — no causal chain visible

→ Scroll · Click any screen to enlarge

Timeline-based investigation flow

Established a chronological event timeline with AI-generated relationship annotations — causal arrows showing why signals were grouped. Engineers could read the incident as a narrative rather than a list of disconnected signals. Reviewed with design leadership and observability architects. Validated the timeline-first mental model.

Timeline-first AI annotations added

Iteration

🔍

V2 · Timeline investigation — AI anomaly signal panel

🔍

V2 · AI insight drawer — root cause chain with context

→ Scroll · Click any screen to enlarge

Persona-adaptive views · Narrative-first · Shipped

Introduced persona-adaptive views — deep telemetry narrative for SREs, action-oriented briefings for SAMs and CSEs. Elevated AI investigation summaries into primary entry point. Adopted narrative-first: AI correlation leads, raw telemetry on demand. Validated with product stakeholders and observability architects. MTTR reduced from 14 to 3 minutes. 92% adoption at launch.

Shipped Research-validated Persona-adaptive

Shipped

🔍

Final · Full dashboard — AI-native event correlation

Final Screen 2 — Instance Performance overview

🔍

Final · Instance Performance overview — Alerts, Diagnostics

🔍

Final · AI insight panel — root cause chain & recommended actions

Final Screen 4 — Causal chain reconstruction

🔍

Final · Causal chain reconstruction — 8-step propagation timeline

→ Scroll · Click any screen to enlarge

07 · Final Experience

The Final Experience

A purpose-built investigation experience layered onto the Instance Performance platform

The final design introduced a dedicated Event Correlation workflow — accessible from any alert, anomaly, or degradation signal. Engineers could move from an alert notification directly into a timeline of related events, with AI-annotated relationship chains showing why signals were grouped.

How the experience worked end-to-end

Alert → Correlation timeline in one click

Engineers move directly from an alert notification into a timeline of related events. No tab-switching. No manual reconstruction. The investigation starts the moment the alert is opened — context already assembled.

AI-annotated relationship chains

Every event group shows the AI's reasoning: what signals are correlated, confidence level, and the raw evidence behind the conclusion. Engineers can expand to verify or dismiss with a single click — always in control.

Role-calibrated AI summaries

AI-generated summaries adapt to the user's role context — deep telemetry narrative for SREs, action-oriented briefings for SAMs and CSEs. Same underlying data, different cognitive depth per persona.

Raw telemetry on demand

Any individual signal can be drilled into for full logs, metrics, and traces. Full depth always available — never forced upfront. Detail on demand, not by default. Engineers decide when to go deeper.

Confidence levels & evidence trails

Every AI annotation links back to the raw signals that generated it. Correlation chains could be expanded or dismissed. Summaries showed confidence scores. Every claim was verifiable — trust built through transparency.

Persona-specific views

SRE · Site Reliability Engineer

Deep telemetry narrative

Full event timeline with causal chain

AI correlation with expandable evidence

Raw logs, metrics & traces on demand

Before / during / after comparison view

SAM · Support Account Manager

Action-oriented summary

Plain-language incident summary

Customer impact scope and duration

Current status & escalation state

Recommended client communication

CSE · Customer Support Engineer

Guided triage flow

AI-suggested root cause candidates

Similar past incident references

Next steps & escalation triggers

Confidence indicators on AI outputs

Interaction design · behavioral messages reinforced

🎯

Behavioral message

You are in control

Correlation chains could be expanded or dismissed. Summaries showed confidence levels. Every AI annotation linked back to the raw signals that generated it.

🤖

Behavioral message

The AI is working for you

AI-surfaced correlations felt like discoveries engineers could verify — not conclusions handed to them. The investigation frame was always engineer-owned.

🔐

Behavioral message

Trust is earned incrementally

Confidence levels, expandable reasoning, and evidence trails built trust through transparency — not assertion. Every small interaction reinforced one message.

Impact

14→3

Minutes · Mean Time to Resolution

78%

Reduction in investigation time

92%

Adoption rate at launch

Personas served by one adaptive experience

07b · The Design in Action

Prototype Walkthrough

The Design in Action

Explore the AI-native event correlation workflow — from alert trigger to causal chain reconstruction, role-adaptive summaries, and evidence-backed AI reasoning.

🔒 mahenderuxdesigner.com / portfolio / ai-native-event-correlation

New Tab

AI-Native Event Correlation — ServiceNow

Cloud Observer

08 · Reflection

Reflection

What this project taught me about designing for understanding

The deepest lesson wasn't about AI, or observability, or enterprise SaaS specifically. It was about the difference between a data problem and an understanding problem — and how rarely those two things are the same.

Six things I'm carrying forward

🔍

Data and understanding are not the same thing

The platform had telemetry. Engineers had skills. Signals were firing at scale. Incidents still took 14 minutes to resolve — because no one had designed for the cognitive work between alert and answer.

🧭

The gap lives where metrics don't capture

That gap lives in a place that monitoring metrics don't capture and feature requests rarely name. Finding it required research that went beyond workflows into mental models.

🤝

AI trust is a design problem, not an engineering one

Users weren't resistant to AI. They were resistant to opacity. The moment reasoning became visible — confidence levels, evidence trails, annotations — skepticism turned to adoption.

🏗️

The decisions before any screen is drawn matter most

The hardest part wasn't the AI layer. It was defending the investigation-first frame long enough for the team to see it validated. That's the work that determined everything downstream.

👥

Personas are architecturally different, not just stylistically

SREs and SAMs don't need different polish on the same information. They need fundamentally different narratives. Progressive disclosure had to work at data-structure level, not just UI level.

📐

Principles must be decision filters, not documentation

The four design principles only worked because they were used as active filters in every design review — not written once and filed away. That's what made them real constraints.

Design principles · validated outcomes

Investigation before visualisation

Every screen element tied to an investigation question. Removed 3 visualisation components in V2 review that displayed data without accelerating a decision.

Time as context

Chronological event timeline became the primary navigation model. Engineers could read incident causality without mental reconstruction — critical during high-stress P1 incidents.

Progressive disclosure

Same correlation data rendered at three depths across SRE, SAM, and CSE. Role context determined cognitive depth — not a toggle or preference setting.

Explainable AI

Every AI output linked to the raw signals that generated it. Confidence scores displayed alongside conclusions. Engineers went from ignoring AI to actioning it — 92% adoption in 60 days.

Final thought

"The hardest part of this project wasn't the AI layer. It was defending the investigation-first frame long enough for the team to see it validated."

That's the work I find most meaningful — the decisions made before any screen is drawn. A design strategy that prioritised human reasoning over data completeness. Research that went beyond workflows into mental models. And the discipline to hold a frame under pressure long enough to see it proved right.

Mahender Kommaganti · Sr. UX Designer

ServiceNow · OODP · 2024

6 months · End-to-end ownership