Cross-Domain Correlation: Why Your Observability Stack Can't Find Root Cause

Cross-Domain Correlation: Why Your Observability Stack Can't Find Root Cause

Team Rebase

,

Your observability tools show symptoms but can't explain root causes because they don't connect code changes, infrastructure metrics, and incident history across domains.

Your observability tools show symptoms but can't explain root causes because they don't connect code changes, infrastructure metrics, and incident history across domains.

Your Datadog dashboard shows latency spikes. PagerDuty fires alerts. Splunk logs show errors. GitHub shows a deployment 20 minutes ago. Kubernetes metrics look normal.

Four tools. Four data sources. Zero understanding of how they connect.

This is the fundamental problem with modern incident response: your tools don't talk to each other. Each observability platform, incident manager, and code repository captures valuable signals, but connecting those signals requires a human engineer manually correlating timestamps, service names, and deployment IDs across browser tabs.

That's not root cause analysis. That's archaeology.

What Cross-Domain Correlation Actually Means

Cross-domain correlation is the ability to automatically connect signals across fundamentally different data types and systems: code changes, infrastructure metrics, application traces, deployment events, and incident history.

The "domains" aren't just different tools. They're different types of knowledge:

Domain

Data Types

Typical Sources

Code

Commits, PRs, file changes, complexity metrics

GitHub, GitLab, Bitbucket

Infrastructure

CPU, memory, network, capacity, scaling events

AWS CloudWatch, Prometheus, cloud APIs

Application

Traces, spans, latency, error rates, dependencies

Datadog APM, New Relic, Jaeger

Operations

Alerts, incidents, on-call history, runbooks

PagerDuty, Opsgenie, incident management

Deployment

Releases, rollbacks, feature flags, config changes

ArgoCD, Spinnaker, LaunchDarkly

Most "unified observability" platforms claim to correlate data. What they actually do is aggregate it in one UI. You still manually connect the dots.

Datadog's approach: unified service tagging (env, service, version) enables correlation within their platform. If your traces and logs share the same tags, you can pivot between them. But Datadog doesn't know what code changed in that deployment, who approved it, or whether similar changes caused incidents before.

Splunk's approach: a common data platform that ingests everything. Powerful for ad-hoc queries, but correlation requires writing SPL queries that assume you already know what you're looking for.

BigPanda's approach: event correlation that clusters alerts by topology, time, and historical patterns. Excellent for reducing alert noise. But they operate purely in the operations domain with no code context, no infrastructure capacity awareness, no deployment history.

These are valuable tools. They're not solving cross-domain correlation.

The Problem Isn't Data. It's Relationships

Here's an incident that took 4 hours to diagnose:

Symptom: Payment API p95 latency jumped from 120ms to 850ms at 14:23 UTC.

What the tools showed:

  • Datadog: Latency spike on payment-service, traces show slow database queries

  • PagerDuty: Alert fired at 14:25 UTC, on-call engineer paged

  • AWS CloudWatch: RDS connection count elevated, CPU normal

  • Kubernetes: No pod restarts, no OOM kills

  • GitHub: 3 deployments in past 2 hours to different services

What actually happened: A deployment to user-preferences (not payment-service) introduced an N+1 query pattern that exhausted the shared database connection pool. Payment service latency was a downstream effect, not the root cause.

The engineer spent 3 hours investigating the payment service because that's where the alert fired. The connection to the user-preferences deployment required:

  1. Noticing that database connections spiked before the latency alert

  2. Checking which services share that database (tribal knowledge, not in any tool)

  3. Correlating deployment times across multiple services

  4. Reviewing the actual code changes in each deployment

  5. Recognizing the N+1 pattern in the new code

This isn't a tooling failure. Each tool did its job. It's a relationship failure. No system understood:

  • Which services share infrastructure dependencies

  • How code changes map to production behavior

  • Which deployment changed what code paths

  • Historical patterns of similar failures

How Cross-Domain Correlation Actually Works

Real cross-domain correlation requires three capabilities that most platforms lack:

1. A Semantic Model of Your System

You need a graph, not a dashboard, that represents relationships:

user-preferences-service
  ├── DEPLOYED_BY → deploy-7f8a2c (14:15 UTC)
  │     └── CHANGED → UserPreferences.loadAll() [+47 lines]

This isn't just service discovery. It's encoding the relationships that matter for incident investigation: code ownership, infrastructure dependencies, deployment lineage, and historical incident patterns.

2. Temporal Correlation with Causality Awareness

Most correlation is time-based: "these events happened around the same time, so they're probably related."

That's necessary but insufficient. Time correlation produces false positives when:

  • Unrelated deployments happen during the same deployment window

  • Multiple alerts fire for a single root cause (symptom correlation, not cause correlation)

  • Periodic jobs trigger around incident time

Better correlation requires causal reasoning: not just "what happened at the same time?" but "what could have caused this behavior?"

For the payment service incident:

  • Time correlation would flag all 3 recent deployments equally

  • Causal correlation would prioritize the user-preferences deployment because:

    • It touched code that interacts with the shared database

    • The deployment completed before the latency spike (correct temporal ordering for causation)

    • The code change pattern (new loops with database calls) is a known N+1 risk

    • Historical incidents involving this service were also database-related

This is where most "AI-powered" RCA tools fall short. They correlate signals without modeling causation. Finding correlated events is easy. Understanding which correlation implies causation requires domain knowledge encoded in the system.

3. Cross-Tool Data Synthesis

The hardest part isn't the algorithm. It's getting the data.

Correlating a GitHub commit to a Datadog latency spike requires:

  • Extracting commit SHAs from deployment events

  • Mapping deployments to Kubernetes pods to service endpoints

  • Connecting service endpoints to APM traces

  • Normalizing timestamps across systems with different clock skews

  • Handling the semantic gap between "file changed" and "code path executed"

Most observability platforms punt on this. They'll ingest your data, but you're responsible for the mapping. That's why Datadog's correlation works best when everything is already instrumented with their agent and tagged with their conventions.

Cross-domain correlation across heterogeneous tools requires an integration layer that:

  • Normalizes entity identities across systems (service A in GitHub = service-a in Kubernetes = ServiceA in Datadog)

  • Maintains temporal consistency despite clock drift

  • Infers relationships that aren't explicitly declared

Why Observability Platforms Don't Solve This

Datadog, Splunk, and New Relic are excellent at what they do. They're not designed for cross-domain correlation because:

Their data models are metric/trace/log-centric, not entity-centric. They know about hosts, services, and spans. They don't model code changes, team ownership, deployment pipelines, or architectural dependencies as first-class entities.

Their correlation is within-platform, not across-platform. Datadog correlates traces to logs beautifully, if both are in Datadog. Correlating Datadog traces to GitHub commits to PagerDuty incidents requires custom integration work that most teams never build.

They're optimized for real-time monitoring, not historical pattern recognition. When an incident happens, you need to know: "Has this happened before? What fixed it last time? Which code changes correlate with this failure pattern historically?" That requires long-term storage and analysis of incident/deployment/code relationships, not something observability platforms prioritize.

They treat code as external context, not core data. Code changes are the most common root cause of production incidents. But observability platforms treat code as annotation ("deployment happened at X") rather than analyzable data ("this deployment changed 3 files touching the authentication flow, similar to the change that caused the March incident").

The Difference Between Correlation and Intelligence

Most AIOps tools stop at correlation. They cluster related alerts, reduce noise, and surface probable root causes based on temporal and topological proximity.

That's valuable. It's not sufficient.

Consider what an experienced SRE does that correlation can't:

  1. Recognizes patterns across incidents: "This looks like the connection pool exhaustion we had in March"

  2. Weighs evidence by relevance: "The deployment to user-preferences matters more than the marketing-site deployment because it touches shared infrastructure"

  3. Considers counterfactuals: "If the user-preferences deployment hadn't happened, would we still see this latency?"

  4. Incorporates organizational context: "Team Platform deployed this; they had 3 rollbacks last month. Let's check their changes first"

This is engineering intelligence, not just correlation. It requires:

  • Historical incident memory (what happened before?)

  • Code-level understanding (what did this change actually do?)

  • Architectural awareness (how do components interact?)

  • Team and process context (who did what, and is that unusual?)

What We're Building at Rebase

We're building an Engineering Intelligence Platform specifically because we saw this gap.

Rebase maintains a knowledge graph of your engineering system: services, dependencies, code ownership, deployment history, infrastructure topology, incident patterns, and team structure. When something breaks, we don't just correlate timestamps. We reason about causation:

For the payment service incident, Rebase would surface:


This isn't magic. It's encoding the relationships and reasoning patterns that experienced SREs use, then applying them automatically across every incident.

The Broader Vision: Beyond Incident Response

Cross-domain correlation for incidents is the starting point. The same capability enables:

Proactive risk detection: "This PR touches code paths similar to 3 past incidents. The deployment window is Friday afternoon. Infrastructure is at 82% capacity. Recommend Monday deployment with staged rollout."

Architectural intelligence: "Services touching the auth layer have 3x higher incident rates. 5 teams own code in this module with no clear ownership boundary."

Technical debt quantification: "This legacy module correlates with 40% of P0 incidents. Refactoring effort: 6 weeks. Projected incident reduction: 60%."

Team effectiveness insights: "Team A's deployments have 4x rollback rate. Root cause: inadequate infrastructure context during code review. Team A works on high-scale services but reviews don't check capacity."

These insights require the same foundation: a semantic model of your system, cross-domain data synthesis, and causal reasoning, not just correlation.

The tools we have are excellent at their individual jobs. Datadog for observability. PagerDuty for incident response. GitHub for code. Kubernetes for orchestration.

What's missing is the intelligence layer that connects them, that understands not just what's happening, but why, and what you should do about it.

That's what cross-domain correlation makes possible. That's what we're building.