Team Rebase
,
Your Datadog dashboard shows latency spikes. PagerDuty fires alerts. Splunk logs show errors. GitHub shows a deployment 20 minutes ago. Kubernetes metrics look normal.
Four tools. Four data sources. Zero understanding of how they connect.
This is the fundamental problem with modern incident response: your tools don't talk to each other. Each observability platform, incident manager, and code repository captures valuable signals, but connecting those signals requires a human engineer manually correlating timestamps, service names, and deployment IDs across browser tabs.
That's not root cause analysis. That's archaeology.
What Cross-Domain Correlation Actually Means
Cross-domain correlation is the ability to automatically connect signals across fundamentally different data types and systems: code changes, infrastructure metrics, application traces, deployment events, and incident history.
The "domains" aren't just different tools. They're different types of knowledge:
Domain | Data Types | Typical Sources |
|---|---|---|
Code | Commits, PRs, file changes, complexity metrics | GitHub, GitLab, Bitbucket |
Infrastructure | CPU, memory, network, capacity, scaling events | AWS CloudWatch, Prometheus, cloud APIs |
Application | Traces, spans, latency, error rates, dependencies | Datadog APM, New Relic, Jaeger |
Operations | Alerts, incidents, on-call history, runbooks | PagerDuty, Opsgenie, incident management |
Deployment | Releases, rollbacks, feature flags, config changes | ArgoCD, Spinnaker, LaunchDarkly |
Most "unified observability" platforms claim to correlate data. What they actually do is aggregate it in one UI. You still manually connect the dots.
Datadog's approach: unified service tagging (env, service, version) enables correlation within their platform. If your traces and logs share the same tags, you can pivot between them. But Datadog doesn't know what code changed in that deployment, who approved it, or whether similar changes caused incidents before.
Splunk's approach: a common data platform that ingests everything. Powerful for ad-hoc queries, but correlation requires writing SPL queries that assume you already know what you're looking for.
BigPanda's approach: event correlation that clusters alerts by topology, time, and historical patterns. Excellent for reducing alert noise. But they operate purely in the operations domain with no code context, no infrastructure capacity awareness, no deployment history.
These are valuable tools. They're not solving cross-domain correlation.
The Problem Isn't Data. It's Relationships
Here's an incident that took 4 hours to diagnose:
Symptom: Payment API p95 latency jumped from 120ms to 850ms at 14:23 UTC.
What the tools showed:
Datadog: Latency spike on
payment-service, traces show slow database queriesPagerDuty: Alert fired at 14:25 UTC, on-call engineer paged
AWS CloudWatch: RDS connection count elevated, CPU normal
Kubernetes: No pod restarts, no OOM kills
GitHub: 3 deployments in past 2 hours to different services
What actually happened: A deployment to user-preferences (not payment-service) introduced an N+1 query pattern that exhausted the shared database connection pool. Payment service latency was a downstream effect, not the root cause.
The engineer spent 3 hours investigating the payment service because that's where the alert fired. The connection to the user-preferences deployment required:
Noticing that database connections spiked before the latency alert
Checking which services share that database (tribal knowledge, not in any tool)
Correlating deployment times across multiple services
Reviewing the actual code changes in each deployment
Recognizing the N+1 pattern in the new code
This isn't a tooling failure. Each tool did its job. It's a relationship failure. No system understood:
Which services share infrastructure dependencies
How code changes map to production behavior
Which deployment changed what code paths
Historical patterns of similar failures
How Cross-Domain Correlation Actually Works
Real cross-domain correlation requires three capabilities that most platforms lack:
1. A Semantic Model of Your System
You need a graph, not a dashboard, that represents relationships:
This isn't just service discovery. It's encoding the relationships that matter for incident investigation: code ownership, infrastructure dependencies, deployment lineage, and historical incident patterns.
2. Temporal Correlation with Causality Awareness
Most correlation is time-based: "these events happened around the same time, so they're probably related."
That's necessary but insufficient. Time correlation produces false positives when:
Unrelated deployments happen during the same deployment window
Multiple alerts fire for a single root cause (symptom correlation, not cause correlation)
Periodic jobs trigger around incident time
Better correlation requires causal reasoning: not just "what happened at the same time?" but "what could have caused this behavior?"
For the payment service incident:
Time correlation would flag all 3 recent deployments equally
Causal correlation would prioritize the
user-preferencesdeployment because:It touched code that interacts with the shared database
The deployment completed before the latency spike (correct temporal ordering for causation)
The code change pattern (new loops with database calls) is a known N+1 risk
Historical incidents involving this service were also database-related
This is where most "AI-powered" RCA tools fall short. They correlate signals without modeling causation. Finding correlated events is easy. Understanding which correlation implies causation requires domain knowledge encoded in the system.
3. Cross-Tool Data Synthesis
The hardest part isn't the algorithm. It's getting the data.
Correlating a GitHub commit to a Datadog latency spike requires:
Extracting commit SHAs from deployment events
Mapping deployments to Kubernetes pods to service endpoints
Connecting service endpoints to APM traces
Normalizing timestamps across systems with different clock skews
Handling the semantic gap between "file changed" and "code path executed"
Most observability platforms punt on this. They'll ingest your data, but you're responsible for the mapping. That's why Datadog's correlation works best when everything is already instrumented with their agent and tagged with their conventions.
Cross-domain correlation across heterogeneous tools requires an integration layer that:
Normalizes entity identities across systems (service A in GitHub = service-a in Kubernetes = ServiceA in Datadog)
Maintains temporal consistency despite clock drift
Infers relationships that aren't explicitly declared
Why Observability Platforms Don't Solve This
Datadog, Splunk, and New Relic are excellent at what they do. They're not designed for cross-domain correlation because:
Their data models are metric/trace/log-centric, not entity-centric. They know about hosts, services, and spans. They don't model code changes, team ownership, deployment pipelines, or architectural dependencies as first-class entities.
Their correlation is within-platform, not across-platform. Datadog correlates traces to logs beautifully, if both are in Datadog. Correlating Datadog traces to GitHub commits to PagerDuty incidents requires custom integration work that most teams never build.
They're optimized for real-time monitoring, not historical pattern recognition. When an incident happens, you need to know: "Has this happened before? What fixed it last time? Which code changes correlate with this failure pattern historically?" That requires long-term storage and analysis of incident/deployment/code relationships, not something observability platforms prioritize.
They treat code as external context, not core data. Code changes are the most common root cause of production incidents. But observability platforms treat code as annotation ("deployment happened at X") rather than analyzable data ("this deployment changed 3 files touching the authentication flow, similar to the change that caused the March incident").
The Difference Between Correlation and Intelligence
Most AIOps tools stop at correlation. They cluster related alerts, reduce noise, and surface probable root causes based on temporal and topological proximity.
That's valuable. It's not sufficient.
Consider what an experienced SRE does that correlation can't:
Recognizes patterns across incidents: "This looks like the connection pool exhaustion we had in March"
Weighs evidence by relevance: "The deployment to
user-preferencesmatters more than themarketing-sitedeployment because it touches shared infrastructure"Considers counterfactuals: "If the
user-preferencesdeployment hadn't happened, would we still see this latency?"Incorporates organizational context: "Team Platform deployed this; they had 3 rollbacks last month. Let's check their changes first"
This is engineering intelligence, not just correlation. It requires:
Historical incident memory (what happened before?)
Code-level understanding (what did this change actually do?)
Architectural awareness (how do components interact?)
Team and process context (who did what, and is that unusual?)
What We're Building at Rebase
We're building an Engineering Intelligence Platform specifically because we saw this gap.
Rebase maintains a knowledge graph of your engineering system: services, dependencies, code ownership, deployment history, infrastructure topology, incident patterns, and team structure. When something breaks, we don't just correlate timestamps. We reason about causation:
For the payment service incident, Rebase would surface:
This isn't magic. It's encoding the relationships and reasoning patterns that experienced SREs use, then applying them automatically across every incident.
The Broader Vision: Beyond Incident Response
Cross-domain correlation for incidents is the starting point. The same capability enables:
Proactive risk detection: "This PR touches code paths similar to 3 past incidents. The deployment window is Friday afternoon. Infrastructure is at 82% capacity. Recommend Monday deployment with staged rollout."
Architectural intelligence: "Services touching the auth layer have 3x higher incident rates. 5 teams own code in this module with no clear ownership boundary."
Technical debt quantification: "This legacy module correlates with 40% of P0 incidents. Refactoring effort: 6 weeks. Projected incident reduction: 60%."
Team effectiveness insights: "Team A's deployments have 4x rollback rate. Root cause: inadequate infrastructure context during code review. Team A works on high-scale services but reviews don't check capacity."
These insights require the same foundation: a semantic model of your system, cross-domain data synthesis, and causal reasoning, not just correlation.
The tools we have are excellent at their individual jobs. Datadog for observability. PagerDuty for incident response. GitHub for code. Kubernetes for orchestration.
What's missing is the intelligence layer that connects them, that understands not just what's happening, but why, and what you should do about it.
That's what cross-domain correlation makes possible. That's what we're building.
Other Articles
How to Evaluate Engineering Intelligence Platforms - EIP (Guide for Engineering Leaders)
A framework for evaluating AI-powered engineering intelligence platforms before vendors define success for you.
Team Rebase
Nov 19, 2025
/
Engineering
Why Your AI-Powered Root Cause Analysis Is Probably Wrong
AI-powered root cause analysis sounds great until you realize most tools just cluster alerts by time and topology without understanding how changes propagate through your system. Learn why correlation fails and what true causal reasoning requires to actually fix production incidents.
Team Rebase
Nov 19, 2025
/
Engineering
What Is Engineering Intelligence? And Why We Built Rebase
After interviewing 200+ engineering leaders, we kept hearing the same problem: teams have all the monitoring data but none of the intelligence to understand why incidents keep happening. Rebase introduces Engineering Intelligence—a new category that correlates signals across your entire SDLC to predict risks, prevent incidents, and transform reactive troubleshooting into proactive decision-making.
Team Rebase
Nov 19, 2025
/
Engineering

