Product

Integrations

Blog

Request Demo

Back

6 min read

Nov 19, 2025

Cross-Domain Correlation: Why Your Observability Stack Can't Find Root Cause

Team Rebase

Your observability tools show symptoms but can't explain root causes because they don't connect code changes, infrastructure metrics, and incident history across domains.

What Cross-Domain Correlation Actually Means

The Problem Isn't Data. It's Relationships

How Cross-Domain Correlation Actually Works

Why Observability Platforms Don't Solve This

The Difference Between Correlation and Intelligence

What We're Building at Rebase

The Broader Vision: Beyond Incident Response

Your Datadog dashboard shows latency spikes. PagerDuty fires alerts. Splunk logs show errors. GitHub shows a deployment 20 minutes ago. Kubernetes metrics look normal.

Four tools. Four data sources. Zero understanding of how they connect.

This is the fundamental problem with modern incident response: your tools don't talk to each other. Each observability platform, incident manager, and code repository captures valuable signals, but connecting those signals requires a human engineer manually correlating timestamps, service names, and deployment IDs across browser tabs.

That's not root cause analysis. That's archaeology.

What Cross-Domain Correlation Actually Means

Cross-domain correlation is the ability to automatically connect signals across fundamentally different data types and systems: code changes, infrastructure metrics, application traces, deployment events, and incident history.

The "domains" aren't just different tools. They're different types of knowledge:

Domain	Data Types	Typical Sources
Code	Commits, PRs, file changes, complexity metrics	GitHub, GitLab, Bitbucket
Infrastructure	CPU, memory, network, capacity, scaling events	AWS CloudWatch, Prometheus, cloud APIs
Application	Traces, spans, latency, error rates, dependencies	Datadog APM, New Relic, Jaeger
Operations	Alerts, incidents, on-call history, runbooks	PagerDuty, Opsgenie, incident management
Deployment	Releases, rollbacks, feature flags, config changes	ArgoCD, Spinnaker, LaunchDarkly

Most "unified observability" platforms claim to correlate data. What they actually do is aggregate it in one UI. You still manually connect the dots.

Datadog's approach: unified service tagging (env, service, version) enables correlation within their platform. If your traces and logs share the same tags, you can pivot between them. But Datadog doesn't know what code changed in that deployment, who approved it, or whether similar changes caused incidents before.

Splunk's approach: a common data platform that ingests everything. Powerful for ad-hoc queries, but correlation requires writing SPL queries that assume you already know what you're looking for.

BigPanda's approach: event correlation that clusters alerts by topology, time, and historical patterns. Excellent for reducing alert noise. But they operate purely in the operations domain with no code context, no infrastructure capacity awareness, no deployment history.

These are valuable tools. They're not solving cross-domain correlation.

The Problem Isn't Data. It's Relationships

Here's an incident that took 4 hours to diagnose:

Symptom: Payment API p95 latency jumped from 120ms to 850ms at 14:23 UTC.

What the tools showed:

Datadog: Latency spike on payment-service, traces show slow database queries
PagerDuty: Alert fired at 14:25 UTC, on-call engineer paged
AWS CloudWatch: RDS connection count elevated, CPU normal
Kubernetes: No pod restarts, no OOM kills
GitHub: 3 deployments in past 2 hours to different services

What actually happened: A deployment to user-preferences (not payment-service) introduced an N+1 query pattern that exhausted the shared database connection pool. Payment service latency was a downstream effect, not the root cause.

The engineer spent 3 hours investigating the payment service because that's where the alert fired. The connection to the user-preferences deployment required:

Noticing that database connections spiked before the latency alert
Checking which services share that database (tribal knowledge, not in any tool)
Correlating deployment times across multiple services
Reviewing the actual code changes in each deployment
Recognizing the N+1 pattern in the new code

This isn't a tooling failure. Each tool did its job. It's a relationship failure. No system understood:

Which services share infrastructure dependencies
How code changes map to production behavior
Which deployment changed what code paths
Historical patterns of similar failures

How Cross-Domain Correlation Actually Works

Real cross-domain correlation requires three capabilities that most platforms lack:

1. A Semantic Model of Your System

You need a graph, not a dashboard, that represents relationships:

user-preferences-service
  ├── DEPLOYED_BY → deploy-7f8a2c (14:15 UTC)
  │     └── CHANGED → UserPreferences.loadAll() [+47 lines]

This isn't just service discovery. It's encoding the relationships that matter for incident investigation: code ownership, infrastructure dependencies, deployment lineage, and historical incident patterns.

2. Temporal Correlation with Causality Awareness

Most correlation is time-based: "these events happened around the same time, so they're probably related."

That's necessary but insufficient. Time correlation produces false positives when:

Unrelated deployments happen during the same deployment window
Multiple alerts fire for a single root cause (symptom correlation, not cause correlation)
Periodic jobs trigger around incident time

Better correlation requires causal reasoning: not just "what happened at the same time?" but "what could have caused this behavior?"

For the payment service incident:

Time correlation would flag all 3 recent deployments equally
Causal correlation would prioritize the user-preferences deployment because:
- It touched code that interacts with the shared database
- The deployment completed before the latency spike (correct temporal ordering for causation)
- The code change pattern (new loops with database calls) is a known N+1 risk
- Historical incidents involving this service were also database-related

This is where most "AI-powered" RCA tools fall short. They correlate signals without modeling causation. Finding correlated events is easy. Understanding which correlation implies causation requires domain knowledge encoded in the system.

3. Cross-Tool Data Synthesis

The hardest part isn't the algorithm. It's getting the data.

Correlating a GitHub commit to a Datadog latency spike requires:

Extracting commit SHAs from deployment events
Mapping deployments to Kubernetes pods to service endpoints
Connecting service endpoints to APM traces
Normalizing timestamps across systems with different clock skews
Handling the semantic gap between "file changed" and "code path executed"

Most observability platforms punt on this. They'll ingest your data, but you're responsible for the mapping. That's why Datadog's correlation works best when everything is already instrumented with their agent and tagged with their conventions.

Cross-domain correlation across heterogeneous tools requires an integration layer that:

Normalizes entity identities across systems (service A in GitHub = service-a in Kubernetes = ServiceA in Datadog)
Maintains temporal consistency despite clock drift
Infers relationships that aren't explicitly declared

Why Observability Platforms Don't Solve This

Datadog, Splunk, and New Relic are excellent at what they do. They're not designed for cross-domain correlation because:

Their data models are metric/trace/log-centric, not entity-centric. They know about hosts, services, and spans. They don't model code changes, team ownership, deployment pipelines, or architectural dependencies as first-class entities.

Their correlation is within-platform, not across-platform. Datadog correlates traces to logs beautifully, if both are in Datadog. Correlating Datadog traces to GitHub commits to PagerDuty incidents requires custom integration work that most teams never build.

They're optimized for real-time monitoring, not historical pattern recognition. When an incident happens, you need to know: "Has this happened before? What fixed it last time? Which code changes correlate with this failure pattern historically?" That requires long-term storage and analysis of incident/deployment/code relationships, not something observability platforms prioritize.

They treat code as external context, not core data. Code changes are the most common root cause of production incidents. But observability platforms treat code as annotation ("deployment happened at X") rather than analyzable data ("this deployment changed 3 files touching the authentication flow, similar to the change that caused the March incident").

The Difference Between Correlation and Intelligence

Most AIOps tools stop at correlation. They cluster related alerts, reduce noise, and surface probable root causes based on temporal and topological proximity.

That's valuable. It's not sufficient.

Consider what an experienced SRE does that correlation can't:

Recognizes patterns across incidents: "This looks like the connection pool exhaustion we had in March"
Weighs evidence by relevance: "The deployment to user-preferences matters more than the marketing-site deployment because it touches shared infrastructure"
Considers counterfactuals: "If the user-preferences deployment hadn't happened, would we still see this latency?"
Incorporates organizational context: "Team Platform deployed this; they had 3 rollbacks last month. Let's check their changes first"

This is engineering intelligence, not just correlation. It requires:

Historical incident memory (what happened before?)
Code-level understanding (what did this change actually do?)
Architectural awareness (how do components interact?)
Team and process context (who did what, and is that unusual?)

What We're Building at Rebase

We're building an Engineering Intelligence Platform specifically because we saw this gap.

Rebase maintains a knowledge graph of your engineering system: services, dependencies, code ownership, deployment history, infrastructure topology, incident patterns, and team structure. When something breaks, we don't just correlate timestamps. We reason about causation:

For the payment service incident, Rebase would surface:

This isn't magic. It's encoding the relationships and reasoning patterns that experienced SREs use, then applying them automatically across every incident.

The Broader Vision: Beyond Incident Response

Cross-domain correlation for incidents is the starting point. The same capability enables:

Proactive risk detection: "This PR touches code paths similar to 3 past incidents. The deployment window is Friday afternoon. Infrastructure is at 82% capacity. Recommend Monday deployment with staged rollout."

Architectural intelligence: "Services touching the auth layer have 3x higher incident rates. 5 teams own code in this module with no clear ownership boundary."

Technical debt quantification: "This legacy module correlates with 40% of P0 incidents. Refactoring effort: 6 weeks. Projected incident reduction: 60%."

Team effectiveness insights: "Team A's deployments have 4x rollback rate. Root cause: inadequate infrastructure context during code review. Team A works on high-scale services but reviews don't check capacity."

These insights require the same foundation: a semantic model of your system, cross-domain data synthesis, and causal reasoning, not just correlation.

The tools we have are excellent at their individual jobs. Datadog for observability. PagerDuty for incident response. GitHub for code. Kubernetes for orchestration.

What's missing is the intelligence layer that connects them, that understands not just what's happening, but why, and what you should do about it.

That's what cross-domain correlation makes possible. That's what we're building.

Cross-Domain Correlation: Why Your Observability Stack Can't Find Root Cause