What is the difference between monitoring and observability?

Monitoring answers predefined questions (Is CPU high? Is the service up?). Observability lets you ask arbitrary questions about system state without deploying new code. Observability is built on three pillars: metrics, logs, and traces working together.

What is distributed tracing and why is it important?

Distributed tracing tracks requests as they flow through microservices, showing latency and errors at each hop. It's essential for debugging in distributed systems where a single request might touch dozens of services. Tools like Jaeger and OpenTelemetry implement tracing.

45+ Monitoring & Observability Interview Questions 2025: Metrics, Logs & Traces

Q: What are the three pillars of observability?

Metrics (numerical measurements over time), Logs (discrete event records), and Traces (request flow across services). Together they provide complete visibility: metrics show what's wrong, logs show why, traces show where in distributed systems.

Q: What is the difference between SLI, SLO, and SLA?

SLI (Service Level Indicator) is a metric measuring service behavior (e.g., latency). SLO (Service Level Objective) is an internal target for that metric (e.g., 99% of requests under 200ms). SLA (Service Level Agreement) is a contractual commitment with consequences for missing targets.

Q: How do you prevent alert fatigue?

Alert on symptoms not causes, set appropriate thresholds with hysteresis, group related alerts, ensure every alert is actionable, regularly review and prune alerts, and use severity levels appropriately. If an alert doesn't require human action, it shouldn't page anyone.

Observability is how you understand what's happening in production. When something breaks at 3 AM, your observability stack determines whether you fix it in minutes or spend hours guessing.

Interviewers test observability knowledge because it separates engineers who've operated real systems from those who've only built them. You can write perfect code, but if you can't debug it in production, you're not ready for senior roles.

This guide covers the three pillars of observability, the tools you'll be asked about, and the interview questions that test whether you actually understand monitoring or just know the buzzwords.

Monitoring vs Observability Questions
Three Pillars of Observability Questions
Metrics and Prometheus Questions
PromQL Questions
RED and USE Method Questions
Grafana Questions
Logging Questions
ELK Stack Questions
Distributed Tracing Questions
OpenTelemetry Questions
Alerting Questions
SLI, SLO, and SLA Questions
Incident Response Questions
Observability Scenario Questions

Monitoring vs Observability Questions

These terms are often used interchangeably, but they represent different approaches to understanding system health.

What is monitoring and what questions does it answer?

Monitoring is the practice of collecting and analyzing predefined metrics to answer specific questions you've already thought of. It's reactive by design—you set up dashboards and alerts for known failure modes and expected behaviors.

Typical monitoring questions include: Is CPU above 80%? Is the service responding? Are there more than 10 errors per minute? Monitoring tells you that something is wrong, but often lacks the context to explain why.

What is observability and how does it differ from monitoring?

Observability is the ability to understand a system's internal state by examining its external outputs. Unlike monitoring which answers predefined questions, observability lets you ask questions you haven't thought of yet without deploying new code.

With an observable system, you can investigate: Why is this specific user's request slow? What changed between yesterday and today? Which service is causing the cascade failure? Observability tells you why something is wrong, not just that something is wrong.

The key difference comes down to exploration versus alerting. A well-monitored system has good dashboards for known metrics. An observable system has sufficient instrumentation to debug novel problems through exploration and correlation.

Monitoring: "Is the system healthy?" (yes/no)
Observability: "Why did user X's request fail at 2:34 PM?" (investigation)

What makes a system observable?

A system is observable when you can determine its internal state from its external outputs. This requires intentional instrumentation across three dimensions: metrics for quantitative measurement, logs for qualitative detail, and traces for request flow visibility.

Observable systems share common characteristics: high-cardinality data that lets you drill down to individual requests, correlation between different data types, and sufficient context in each signal to support investigation without additional code changes.

Three Pillars of Observability Questions

Observability is built on three complementary data types, each answering different questions about your system.

What are the three pillars of observability?

The three pillars are metrics, logs, and traces. Each serves a distinct purpose, and together they provide complete visibility into system behavior. No single pillar is sufficient—you need all three working together.

Pillar	What It Is	What It Answers
Metrics	Numerical measurements over time	What's happening? How much?
Logs	Discrete event records	Why did it happen? What exactly?
Traces	Request flow across services	Where did it happen? Which path?

How do the three pillars work together in practice?

In a typical debugging scenario, you move between pillars as you narrow down the problem. Each pillar provides context that guides you to the next level of detail.

A typical debugging flow demonstrates this interplay. First, a metrics alert fires showing the error rate spiked to 5%. Then you investigate logs, which reveal errors showing "database connection timeout." Finally, traces pinpoint that requests to /api/orders are slow specifically at the inventory service database call.

Metrics are cheap to store and query but lack detail. Logs are detailed but hard to aggregate across services. Traces show the complete request flow but typically sample only a fraction of requests. The combination covers each pillar's weaknesses.

Metrics and Prometheus Questions

Metrics are numerical measurements collected over time. They form the foundation of monitoring because they're cheap to store and fast to query at scale.

What are the different Prometheus metric types?

Prometheus supports four metric types, each designed for different measurement scenarios. Understanding when to use each type is essential for interviews and for building effective monitoring.

Type	Description	Example
Counter	Only increases (resets on restart)	Total requests, errors, bytes sent
Gauge	Can go up or down	Temperature, queue size, active connections
Histogram	Samples in configurable buckets	Request latency distribution
Summary	Calculates quantiles client-side	Request latency percentiles

When would you use a histogram vs a summary?

This is a common interview question that tests whether you understand the tradeoffs. The key difference is where percentile calculation happens and whether results can be aggregated.

Histograms store observations in configurable buckets and calculate percentiles server-side. Because the raw bucket data is available, you can aggregate histograms across multiple instances—essential for distributed systems. Summaries calculate percentiles client-side and stream pre-computed quantiles, which cannot be meaningfully aggregated.

Use histograms for most cases because they're aggregatable. Use summaries only when you need precise percentiles and don't need to aggregate across instances.

How does Prometheus collect metrics?

Prometheus uses a pull-based model where it scrapes HTTP endpoints exposed by applications. Each application exposes a /metrics endpoint that returns current metric values in Prometheus format. The Prometheus server periodically scrapes these endpoints and stores the data in its time-series database.

This pull model has several advantages: applications don't need to know about Prometheus, failed scrapes are detectable, and you can easily run multiple Prometheus servers for high availability. For short-lived jobs that can't be scraped, Prometheus provides a Pushgateway component.

PromQL Questions

Prometheus Query Language (PromQL) is used by Prometheus and many other systems. Knowing common query patterns is essential for interviews.

What are the essential PromQL queries you should know?

PromQL has specific patterns for each metric type. The most important distinction is between instant vectors (single value per series) and range vectors (values over time).

For counters, always use rate() or increase() to get meaningful values. Counters only go up, so the raw value isn't useful—you need the rate of change.

# Rate of requests per second (for counters)
rate(http_requests_total[5m])
 
# Increase in requests over 1 hour
increase(http_requests_total[1h])

For gauges, the raw value is meaningful since gauges can go up or down.

# Current memory available
node_memory_available_bytes
 
# Average over 5 minutes
avg_over_time(node_memory_available_bytes[5m])

For histograms, use histogram_quantile() to calculate percentiles from bucket data.

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

How do you filter and aggregate metrics in PromQL?

PromQL uses label matching for filtering and aggregation operators for grouping results. Labels are key-value pairs attached to each metric that enable slicing and dicing data.

Filtering uses curly braces with label matchers. The =~ operator enables regex matching.

# Filter by exact label value
http_requests_total{status="500"}
 
# Filter by regex (5xx errors)
http_requests_total{status=~"5.."}
 
# Exclude values
http_requests_total{method!="OPTIONS"}

Aggregation operators combine multiple series. The by clause specifies which labels to preserve.

# Sum all requests
sum(rate(http_requests_total[5m]))
 
# Sum by service
sum(rate(http_requests_total[5m])) by (service)
 
# Average by instance
avg(node_cpu_seconds_total) by (instance)

RED and USE Method Questions

These frameworks provide systematic approaches to monitoring different types of systems.

What is the RED method for monitoring services?

The RED method defines three golden signals for monitoring request-driven services like APIs and microservices. It focuses on user-facing behavior rather than infrastructure metrics.

RED stands for Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution). These three metrics capture the essential user experience: how much traffic, how many failures, and how fast.

# Rate - requests per second
sum(rate(http_requests_total[5m]))
 
# Errors - error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
 
# Duration - 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

The RED method is service-centric. For every service, create a dashboard showing these three metrics. When any degrades, users are likely affected.

What is the USE method for monitoring infrastructure?

The USE method defines three metrics for monitoring infrastructure resources like CPU, memory, disk, and network. It focuses on resource health and capacity planning.

USE stands for Utilization (percentage of resource capacity used), Saturation (queue depth when resource is fully utilized), and Errors (error events for the resource). These three metrics reveal whether resources are the bottleneck.

Resource	Utilization	Saturation	Errors
CPU	CPU usage %	Run queue length	-
Memory	Memory used %	Swap usage	OOM events
Disk	Disk usage %	I/O queue depth	Read/write errors
Network	Bandwidth usage	Dropped packets	Interface errors

When would you use RED vs USE?

Use RED for monitoring services (APIs, microservices, databases-as-services). RED measures user-facing behavior—if RED metrics are healthy, users are happy.

Use USE for monitoring infrastructure resources (servers, containers, VMs). USE measures resource health—if USE metrics are healthy, you have capacity.

In practice, you need both. A service might have good RED metrics while running on saturated infrastructure. Eventually the infrastructure problems will manifest as RED degradation, but USE gives you early warning.

Grafana Questions

Grafana is the standard tool for visualizing metrics from Prometheus and other data sources.

How do you design effective Grafana dashboards?

Dashboard design significantly impacts how quickly teams can diagnose issues. Good dashboards tell a story and guide investigation from high-level health to specific problems.

Start with the four golden signals at the top: latency, traffic, errors, and saturation. These give immediate visibility into service health. Group related metrics together logically—don't scatter CPU metrics across multiple rows.

Use consistent colors and naming conventions. Red should always mean bad. Variables make dashboards reusable across environments (dev, staging, prod) and services. Annotations mark deployments and incidents directly on graphs for correlation.

What are Grafana variables and why are they useful?

Grafana variables create dropdown selectors that parameterize dashboard queries. Instead of creating separate dashboards for each service or environment, one dashboard serves all by selecting the appropriate variable values.

Common variable types include query variables (populated from Prometheus labels), custom variables (static lists), and interval variables (for time-based aggregations). Variables are referenced in queries using $variable_name syntax.

# Query using variables
rate(http_requests_total{service="$service", environment="$environment"}[5m])

How does Grafana alerting work?

Grafana can evaluate queries and trigger alerts when conditions are met. Alerts are configured per panel with threshold conditions, evaluation intervals, and notification channels.

A basic alert checks if a query result crosses a threshold for a specified duration. For example, alert if error rate exceeds 1% for 5 minutes. The duration prevents alerting on brief spikes.

Grafana supports multiple notification channels: email, Slack, PagerDuty, OpsGenie, webhooks. Alert rules can include templated messages with metric values and dashboard links for context.

Logging Questions

Logs record discrete events and provide the detail that metrics lack, but they're expensive to store and query at scale.

What is the difference between structured and unstructured logs?

Structured logs use a consistent format (typically JSON) with defined fields, making them queryable and aggregatable. Unstructured logs are free-form text that requires parsing to extract information.

Unstructured logs are human-readable but machine-unfriendly:

2026-01-07 10:15:32 ERROR Failed to process order #12345 for user john@example.com

Structured logs are machine-parseable and queryable:

{
  "timestamp": "2026-01-07T10:15:32Z",
  "level": "error",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error": "payment_declined"
}

With structured logs, you can query: "Show me all errors for order_id 12345" or "Count errors by error type." These queries are impossible or require complex regex with unstructured logs.

What are log levels and when should you use each?

Log levels categorize messages by severity and help filter noise. Using levels consistently across services enables effective log analysis and alerting.

Level	When to Use
DEBUG	Detailed diagnostic info, disabled in production
INFO	Normal operations worth recording (startup, requests)
WARN	Something unexpected but handled (retry succeeded)
ERROR	Operation failed, needs attention
FATAL	Application cannot continue

The distinction between WARN and ERROR is a common interview question. WARN means something unusual happened, but the system handled it—a retry succeeded, a fallback was used. ERROR means something actually failed and likely needs investigation or action.

What are correlation IDs and why are they important?

A correlation ID (or trace ID) is a unique identifier that links related logs across services. When a request flows through multiple services, each service includes the same correlation ID in its logs, enabling end-to-end request tracking.

Pass the correlation ID in HTTP headers (e.g., X-Correlation-ID or W3C Trace Context headers) and include it in every log entry. This is the bridge between logging and tracing—you can search logs for a specific trace ID to see all related events.

{"correlation_id": "abc-123", "service": "api", "message": "Received order request"}
{"correlation_id": "abc-123", "service": "inventory", "message": "Checking stock"}
{"correlation_id": "abc-123", "service": "payment", "message": "Processing payment"}

ELK Stack Questions

The ELK stack (Elasticsearch, Logstash, Kibana) is a common logging solution, though many teams now use alternatives or cloud services.

How does the ELK stack work?

The ELK stack consists of three components that work together to collect, store, and visualize logs. Each component has a specific role in the pipeline.

flowchart LR
    App["Application"] --> LS["Logstash<br/>(collect)"]
    LS --> ES["Elasticsearch<br/>(store/index)"]
    ES --> K["Kibana<br/>(visualize)"]

Elasticsearch is a distributed search and analytics engine. It stores logs in indices, enables full-text search and aggregations, and scales horizontally with sharding.

Logstash is a data processing pipeline. It collects logs from multiple sources, parses and transforms them (extracting fields, enriching data), and sends them to Elasticsearch.

Kibana is the visualization layer. It provides dashboards, log exploration interfaces, and alerting capabilities on top of Elasticsearch data.

What are common alternatives to the ELK stack?

Many teams replace Logstash with lighter alternatives like Filebeat or Fluent Bit. These agents are more resource-efficient for simple log collection and forwarding.

Cloud logging services (Datadog, Splunk, AWS CloudWatch, Google Cloud Logging) eliminate operational burden but have higher costs at scale. Loki (from Grafana Labs) provides a simpler architecture that indexes only metadata, not log content.

The choice depends on scale, budget, and operational capacity. Self-hosted ELK requires significant expertise to operate reliably at scale.

What log aggregation patterns are commonly used?

Centralized logging means all services send logs to a single system. This provides one place to search and enables correlation across services. The downside is network dependency and potential bottleneck.

The sidecar pattern runs a log collector alongside each service, commonly used in Kubernetes with Fluentd or Fluent Bit. The sidecar handles log rotation, buffering, and retries, isolating the application from logging concerns.

DaemonSet pattern in Kubernetes runs one log collector per node that collects from all pods on that node. This is more resource-efficient than sidecars but requires pods to write to node-level locations.

Distributed Tracing Questions

Tracing shows how requests flow through distributed systems, essential for debugging in microservices architectures.

Why is distributed tracing important in microservices?

In a monolith, a stack trace shows where things went wrong. In microservices, a single request might touch dozens of services, and stack traces only show one service at a time. Tracing solves this by tracking requests across service boundaries.

Consider a request that hits the API gateway, calls auth service, user service, order service, inventory service, and payment service. If the request is slow, which service is the bottleneck? Logs from each service don't show the full picture because they lack timing relationships. Tracing shows exactly where time is spent.

What are traces, spans, and context propagation?

A trace represents the complete journey of a request through a distributed system. It contains multiple spans representing individual operations.

A span represents a single operation within a trace—typically one service call. Each span includes start time, duration, service name, operation name, tags (metadata), and logs (events within the span).

Context propagation is passing trace and span IDs between services so spans can be correlated into a complete trace. Without propagation, you get isolated spans instead of connected traces.

flowchart TB
    T["Trace: order-request-abc123"]
    T --> AG["api-gateway<br/>(10ms)"]
    AG --> Auth["auth-service<br/>(5ms)"]
    AG --> Order["order-service<br/>(200ms)"]
    Order --> Inv["inventory-check<br/>(50ms)"]
    Order --> Pay["payment-process<br/>(140ms)<br/>← bottleneck"]

What are the different tracing sampling strategies?

Tracing every request is expensive at scale. Sampling reduces volume while maintaining visibility into system behavior. The strategy determines which requests get traced.

Strategy	Description	Use Case
Head-based	Decide at request start	Simple, consistent
Tail-based	Decide after request completes	Keep errors/slow requests
Rate limiting	Fixed samples per second	Predictable cost
Probabilistic	Random percentage	Simple to configure

Tail-based sampling is powerful for debugging because you can sample 100% of errors and slow requests while sampling only 1% of successful fast requests. You keep the interesting data while controlling costs.

OpenTelemetry Questions

OpenTelemetry (OTel) is the industry standard for observability instrumentation.

What is OpenTelemetry and why should you use it?

OpenTelemetry is a vendor-neutral observability framework that provides APIs, SDKs, and tools for generating and collecting telemetry data. It's the result of merging two earlier projects: OpenTracing and OpenCensus.

The key benefit is vendor independence. Instrument your code once with OpenTelemetry, then send data to any backend: Jaeger, Zipkin, Datadog, New Relic, or others. If you switch observability vendors, you don't need to re-instrument.

OpenTelemetry provides auto-instrumentation for common frameworks and libraries. For many languages, you can get tracing with minimal code changes by adding the OTel agent.

How does the OpenTelemetry Collector work?

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples instrumentation from backends and provides a central point for processing.

flowchart LR
    App["App<br/>(OTel SDK)"] --> Collector["OTel<br/>Collector"]
    Collector --> Backend["Jaeger/Zipkin/<br/>Datadog"]

The Collector has three components: receivers (accept data in various formats), processors (transform, filter, batch), and exporters (send to backends). This architecture lets you change backends by reconfiguring the Collector rather than changing application code.

What is Jaeger and how is it used?

Jaeger is a popular open-source distributed tracing backend. It stores and indexes traces, provides a UI for visualization, and supports multiple storage backends like Elasticsearch and Cassandra.

Key Jaeger features include service dependency graphs (visualizing service relationships), trace comparison (comparing slow vs fast requests), and root cause analysis (identifying bottlenecks). Jaeger integrates well with Kubernetes and service meshes.

In production, Jaeger typically receives traces from the OpenTelemetry Collector, stores them in Elasticsearch or Cassandra for querying, and provides a web UI for trace exploration.

Alerting Questions

Observability data is useless if no one sees it when things break. Alerting bridges the gap between data and action.

What are the key principles of good alert design?

Well-designed alerts reduce noise while ensuring real issues get attention. The fundamental principle is that every alert should be actionable—if you can't do anything about it, don't wake someone up.

Alert on symptoms, not causes. Bad: "CPU above 80%". Good: "Error rate above 1%". CPU might be high because you're doing useful work—that's not inherently bad. Error rate means users are affected—that always matters.

Use appropriate severity levels. Critical alerts page immediately for user-facing impact. Warning alerts notify but don't page—investigate during business hours. Info alerts are for awareness only.

Include runbook links in alert descriptions. When someone gets paged at 3 AM, they shouldn't have to figure out what to do. The alert should link to step-by-step remediation instructions.

How do you prevent alert fatigue?

Alert fatigue occurs when teams receive so many alerts that they start ignoring them. It's dangerous because real issues get missed. Prevention requires discipline and regular review.

Review alerts regularly and delete ones nobody acts on. If an alert fires weekly and nobody investigates, it's training your team to ignore alerts. Set proper thresholds with hysteresis—alert at 90%, clear at 80%—to prevent flapping alerts.

Group related alerts so one incident doesn't generate 50 separate pages. Distinguish pages (wake someone up) from notifications (async awareness). Track alert metrics: acknowledge rate, false positive rate, time to resolution.

What is hysteresis in alerting and why is it important?

Hysteresis prevents alerts from flapping—repeatedly firing and clearing as a metric oscillates around a threshold. Without hysteresis, a CPU at 79-81% might trigger dozens of alerts per hour.

With hysteresis, you set different thresholds for firing and clearing. Alert when CPU exceeds 80%, clear when it drops below 70%. The metric must genuinely recover, not just dip briefly below the threshold.

This approach reduces noise while ensuring real issues still trigger alerts. The gap between thresholds should be large enough to indicate genuine state change.

SLI, SLO, and SLA Questions

These terms are frequently confused in interviews but represent distinct concepts in reliability engineering.

What is the difference between SLI, SLO, and SLA?

These three concepts form a hierarchy for defining and measuring service reliability. Understanding their relationships is essential for production operations.

Term	Definition	Example
SLI	Service Level Indicator - a metric measuring service behavior	99.2% of requests succeed
SLO	Service Level Objective - internal target for that metric	99.9% success rate target
SLA	Service Level Agreement - contractual commitment	99.5% uptime or customer credits

SLIs measure actual performance—they're the metrics themselves. SLOs set internal targets for those metrics—they define "good enough." SLAs are promises to customers with consequences for missing them—they're usually looser than SLOs to provide buffer.

What is an error budget and how is it used?

An error budget is the inverse of your SLO expressed as allowed failures. If your SLO is 99.9% availability, your error budget is 0.1%—about 43 minutes of downtime per month.

Error budgets create alignment between development velocity and reliability. When you have budget remaining, you can take risks: deploy faster, experiment more. When budget is exhausted, focus shifts to reliability: slower rollouts, more testing, fixing tech debt.

This approach replaces subjective debates ("Is this reliable enough?") with objective decisions based on measured reality. Teams that consistently exhaust their error budget know they need to invest in reliability.

How do you choose good SLIs?

Good SLIs measure what users actually experience, not internal system metrics. Focus on requests, not resources.

For request-driven services, the standard SLIs are availability (percentage of successful requests), latency (request duration at various percentiles), and throughput (requests per second if relevant to user experience).

Avoid vanity metrics that look good but don't reflect user experience. 99.9% of requests succeeding is meaningful. 99.9% CPU availability is not—users don't experience CPU.

Incident Response Questions

Observability enables incident response, and interviewers often ask about on-call practices and postmortems.

What are on-call best practices?

Effective on-call requires clear processes, fair rotations, and good tooling. The goal is quick resolution with sustainable workload.

Rotation schedules should be fair and predictable. Weekly rotations are common, with clear handoffs documenting ongoing issues. Escalation paths define what happens when the primary doesn't respond—secondary on-call, then management.

Runbooks provide step-by-step remediation for common alerts. Good runbooks reduce mean time to resolution and enable less experienced engineers to handle incidents. They should be maintained as living documents updated after each incident.

What is a blameless postmortem?

A blameless postmortem analyzes an incident to understand what happened and prevent recurrence, without assigning individual blame. The focus is on systems and processes, not people.

The standard format covers: what happened (timeline), impact (users affected, duration), root cause (why it happened), contributing factors (what made it worse), and action items (what we'll change). Action items should be specific, assigned, and tracked.

Blameless culture is essential because blame discourages honesty. If people fear punishment, they'll hide mistakes rather than learning from them. The goal is to make the system more resilient, not to punish individuals.

How do you write effective incident timelines?

Incident timelines document what happened and when, providing the foundation for postmortem analysis. Good timelines are detailed, timestamped, and objective.

Include: when the incident was first detected, what alerts fired, when humans were engaged, what actions were taken, when mitigation began working, and when full recovery was confirmed. Note who did what, but avoid blame language.

Use UTC timestamps for clarity across time zones. Link to dashboards, logs, and chat transcripts that provide additional context. The timeline should enable someone who wasn't there to understand exactly what happened.

Observability Scenario Questions

Interviewers often present scenarios to test practical debugging skills.

Your API's p99 latency suddenly increased from 200ms to 2s. How would you investigate?

Start with metrics to understand the scope. Is it all endpoints or specific ones? All users or specific regions? Check recent deployments or config changes using dashboard annotations.

Next, examine traces to identify where time is being spent. If one downstream service shows high latency, that's your bottleneck. Compare slow traces to fast ones from before the incident.

Check resource metrics using the USE method—CPU saturation, memory pressure, disk I/O. Look for database slow queries in logs. Verify external dependency latency if you call third-party APIs.

The key is systematic investigation: narrow scope with metrics, identify bottleneck with traces, understand cause with logs and resource metrics.

How would you design alerting for a new microservice?

Start with RED metrics. Alert on error rate exceeding your SLO (e.g., error rate above 1% for 5 minutes). Alert on latency SLO breach (e.g., p99 above 500ms for 5 minutes).

Add resource alerts only if they correlate with user impact. CPU alerts are often noisy—high CPU during legitimate load isn't a problem. Memory and disk alerts are more actionable.

Include runbook links in every alert. Set up escalation through PagerDuty or OpsGenie. After collecting baseline data (a few weeks), review and tune thresholds based on actual behavior.

You're getting 1000 alerts per day and the team is ignoring them. How do you fix this?

This is an alert fatigue crisis. Start by auditing every alert that fired in the past week. Categorize them: actionable (someone investigated and fixed something), noisy (fired but no action taken), or duplicate (same incident, multiple alerts).

Delete alerts nobody acts on—they're training your team to ignore pages. Consolidate related alerts so one incident doesn't generate dozens of notifications. Separate pages (immediate action required) from notifications (async awareness).

Add hysteresis to prevent flapping. Track metrics going forward: acknowledge rate, time to resolution, false positive rate. Establish a regular review process—monthly alert pruning—to prevent accumulation.

Quick Reference

Three Pillars:

Metrics: What's happening (Prometheus, Grafana)
Logs: Why it happened (ELK, structured logging)
Traces: Where in the system (Jaeger, OpenTelemetry)

RED Method (for services):

Rate, Errors, Duration

USE Method (for resources):

Utilization, Saturation, Errors

Metric Types:

Counter (increases), Gauge (up/down), Histogram (distribution), Summary (percentiles)

SLI/SLO/SLA:

SLI measures, SLO targets, SLA promises

Alert Best Practices:

Symptoms not causes
Actionable with runbooks
Regular review and pruning

Complete DevOps Engineer Interview Guide - Full DevOps interview preparation
Docker Interview Guide - Container fundamentals
Kubernetes Interview Guide - Container orchestration
Linux Commands Interview Guide - Essential Linux skills
CI/CD & GitHub Actions Interview Guide - Pipeline and deployment automation

Table of Contents