Structured Logging for Real Root Cause Analysis
Structured Logging for Real Root Cause Analysis
Why Most Logs Fail When You Need Them Most
Teams usually discover the quality of their logging during an incident.
That is the worst possible time to find out your logs are basically decorative text.
You have probably seen versions of this:
processing order failed
unable to save user
request error
something went wrongThose messages are not useless because they are short. They are useless because they are missing structure.
When a real incident is unfolding, I want to answer questions quickly:
- Which request failed?
- For which tenant or user?
- In which service?
- During which operation?
- What downstream dependency was involved?
- How often is it happening?
- Is this the same failure mode as the last ten errors?
Plain text logs make those questions expensive.
The Job of a Log Event
A good log event should do at least one of these things well:
- describe a meaningful state transition
- preserve enough context to reconstruct a failure
- support aggregation and filtering across a system
That means log lines need predictable fields, not just readable sentences.
The Minimum Event Shape
This is the kind of shape I keep coming back to.
{
"timestamp": "2022-09-14T16:48:21.481Z",
"level": "error",
"service": "billing-api",
"environment": "production",
"message": "payment capture failed",
"event": "payment.capture_failed",
"requestId": "req_8f1c5d",
"traceId": "trc_13aa91",
"tenantId": "tenant_42",
"userId": "usr_91",
"orderId": "ord_8842",
"provider": "stripe",
"errorCode": "card_declined"
}This gives me something I can work with.
I can filter by event, group by provider, correlate by traceId, and measure blast radius by tenantId or userId.
What to Standardize First
If a team is moving from messy logs to structured logs, I usually standardize these fields first.
| Field | Why it matters |
|---|---|
timestamp | ordering and time-window queries |
level | triage and filtering |
service | system ownership |
event | stable machine-readable category |
message | human-readable summary |
requestId | per-request debugging |
traceId | cross-service correlation |
| entity IDs | user, order, tenant, job, etc. |
If you get those right, the rest of the logging strategy gets much easier.
Message Strings Are Not the Schema
One of the most common anti-patterns is hiding all the useful information inside the message itself.
Bad:
logger.error(`Payment failed for order ${order.id} for tenant ${tenant.id}`);Better:
logger.error({
event: 'payment.capture_failed',
orderId: order.id,
tenantId: tenant.id,
provider: 'stripe',
errorCode: error.code,
}, 'payment capture failed');The message helps the human. The fields help the system.
You want both.
Event Names Should Be Stable
I strongly prefer stable event names with a predictable pattern.
Examples:
user.login_succeededuser.login_failedpayment.capture_startedpayment.capture_failedemail.delivery_retried
This makes it much easier to:
- chart events over time
- detect spikes
- group similar failures
- build alerting without fuzzy text matching
The event name should be the durable machine identifier. The message can change more freely.
Correlation IDs Are Not Optional in Distributed Systems
If a request crosses service boundaries, I want a trace or correlation identifier in every service log.
Without that, incident debugging becomes archaeology.
In Node services, I usually push this through request-scoped context.
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] ?? crypto.randomUUID();
req.context = { requestId };
res.setHeader('x-request-id', requestId);
next();
});Then every log emitted during the request gets that identifier attached automatically.
Log the State Change, Not Just the Failure
A lot of teams only log errors. That sounds efficient until you need to understand the timeline around the error.
I want important state transitions too:
- job started
- job completed
- retry scheduled
- external request sent
- external response received
- state changed from pending to failed
This does not mean log everything. It means log the meaningful edges of the workflow.
That way, when a failure happens, I can reconstruct the path that led to it.
Log Context That Explains Blast Radius
One of the highest-value logging habits is including fields that let you measure impact quickly.
Depending on the system, that could be:
tenantIdworkspaceIdcustomerIdjobTyperegionproviderfeatureFlag
If I can answer "who is affected" in two queries instead of twenty minutes, the logs are doing real work.
Redaction Has to Be Part of the Design
Structured logging makes logs more queryable. It can also make leaks easier if you are careless.
My default rules are:
- never log raw auth tokens
- never log passwords or secrets
- avoid full payload logging for PII-heavy domains
- prefer IDs and metadata over full objects
- centralize redaction in the logger, not in every call site
Example:
const logger = pino({
redact: {
paths: [
'req.headers.authorization',
'user.password',
'payment.cardNumber',
],
censor: '[REDACTED]',
},
});If a team has to remember redaction manually at every log call, sensitive data will leak eventually.
The Difference Between Debug Logs and Operational Logs
Not all structured logs are equal.
I think of them in two classes.
Debug Logs
Useful when inspecting a specific issue or local flow.
- detailed payload shapes
- branch decisions
- internal timing
Operational Logs
Useful for incidents, alerting, and production analysis.
- stable event names
- key identifiers
- provider/dependency details
- error classes and codes
Operational logs should exist even when debug logging is off. Otherwise the most useful signals disappear in production.
Queryability Is the Whole Point
If you cannot slice logs by fields, you are doing very expensive string storage.
I want to ask questions like:
- show all
payment.capture_failedevents in the last hour - group failures by
provider - show the top affected
tenantIdvalues - show every event for
traceId=trc_13aa91
That is why structure matters. It turns logs from prose into data.
My Practical Logging Rules
These rules have held up well for me.
- Every log should have a stable
eventfield. - Important workflows should include request or trace correlation.
- Errors should log the dependency and error code when possible.
- Important business entities should be identified by ID.
- Full objects should be rare, not normal.
- Sensitive fields must be redacted centrally.
- Success transitions are worth logging for critical workflows.
What Good Looks Like During an Incident
When logging is healthy, incident response gets simpler fast.
You can answer:
- where the problem started
- which dependency is involved
- who is affected
- whether the error is increasing
- whether retries are helping or hurting
That is what root cause analysis needs. Not prettier strings. Better evidence.
The Main Takeaway
Most logging is optimized for the person writing the line, not the person debugging the outage later.
Structured logging flips that.
It forces teams to capture logs as events with usable fields, stable categories, and the identifiers needed to connect failures across a system.
Once you do that, logs stop being noise and start becoming one of the fastest tools you have for finding the truth.