Strong DevOps candidates are evaluated on four dimensions, regardless of the specific question:

1.  Operational judgment - Trade-offs, failure modes, and rollout strategies over tool-name memorization.

2.  Change safety - Can you ship without breaking things? Do you have rollback paths for every change?

3.  Systems thinking - Do you understand how your tools fail, not just how they work?

4.  Production realism - Real numbers, real incidents, real constraints, not just idealized architectures.

Reading guide

Where to Focus Based on Your Level

Don’t read this like a flat list. Use it like a hiring map and spend your time where your target level is actually judged.

Applying for mid-level

Focus on Q1, Q2, Q4, Q7, Q10, Q13, Q16, Q21, Q26.

Applying for senior

Focus on Q3, Q5, Q6, Q8, Q9, Q11, Q12, Q14, Q15, Q17, Q18, Q19, Q20, Q22, Q23, Q24, Q25, Q27, Q28, Q29.

Applying for Staff / SRE

Focus on Q17, Q18, Q22, Q25, Q28, Q29 plus the system design crossover.

Q1. Docker vs VM - image vs container, and what belongs in a Dockerfile?

Difficulty: Mid-level

What it tests: Whether you understand isolation, packaging, and runtime trade-offs and not just "Docker is lightweight."

Approach: Containers share the host kernel; VMs virtualize hardware with a full guest OS - stronger isolation, higher overhead. An image is a layered read-only snapshot; a container is a running instance of one. Dockerfiles define how to build an image deterministically.

Key components: Image layers, container runtime, registry, Dockerfile build context, multi-stage builds.

Bottlenecks: Cold start time, image bloat, noisy neighbor issues on shared kernels, kernel-level attack surface.

Command: docker build -t app:1.2.3 . pin versions, use multi-stage builds to keep images lean, never run as root.

Numbers: Target image size < 300MB; container start < 2s for common services.

Hiring signal: You mention image scanning, multi-arch builds, and why you'd never run as root in a container without being asked.

Q2. What is Infrastructure as Code (IaC), and how do you ship infra changes safely?

Difficulty: Mid-level

What it tests: Change control and reproducibility discipline, not which tool you prefer.

Approach: Define desired infra in code, review it like application code, and apply via automated pipelines - plan → approve → apply. Treat infra changes with the same rigor as a production deploy.

Key components: Modules/templates, remote state, policy checks (OPA/Sentinel), drift detection, state locking.

Bottlenecks: State contention, partial applies leaving environments in bad states, long-lived environments diverging from code.

Command: terraform plan in PR; terraform apply only from CI with state locking. Never apply from a local machine in prod.

Numbers: Enforce 1 writer per state file; PR review SLA < 24h for infra changes; drift MTTR < 1 business day.

Hiring signal: You bring up rollback strategy for infra (not just app code) and explain why it's harder than people assume.

Q3. Monitoring vs observability: metrics, logs, traces, what do you collect and why?

Difficulty: Senior

What it tests: How you debug under pressure. Metrics alone won't save you, interviewers want to hear how the three signals fit together.

Approach: Three signals: metrics (what's happening), logs (what happened), traces (where time went). Correlate everything via request/trace IDs. Instrument at the source, don't bolt it on later.

Key components: Collectors/agents, time-series DB, log pipeline, trace backend, dashboards, alert manager.

Bottlenecks: High-cardinality metrics exploding storage, log volume/cost at scale, trace sampling strategy under heavy load.

Command: Ensure trace_id/span_id flows through every log line. Keep dashboards readable at 3am.

Numbers: p95 dashboard query < 3s; traces sampled at 1–10% by default; hot log retention 7–14 days, cold 90–180 days.

Hiring signal: You define "observability" as "can you ask novel questions without deploying new code", not just "we have Grafana."

Q4. What is DevOps, and what problem does it actually solve in a real org?

Difficulty: Mid-level

What it tests: Whether you say "culture + systems" or just recite a buzzword definition. Interviewers are listening for operational maturity.

Approach: DevOps reduces handoff friction between build and run teams, shared ownership, automation, fast feedback loops, and safe releases. It's not a team name; it's a way of working.

Key components: CI/CD, IaC, monitoring, incident response, blameless postmortems, DORA metrics.

Bottlenecks: Coordination overhead, unowned systems, manual releases, weak test discipline, "DevOps team as gatekeeper" anti-pattern.

Numbers: Aim for daily deploys on mature services; incident MTTR target < 30 min; change failure rate < 15%.

Hiring signal: You mention DORA metrics (deploy frequency, lead time, change failure rate, MTTR) as the concrete way to measure maturity.

Q5. Walk me through a CI/CD pipeline you built (or would build) end-to-end.

Difficulty: Senior

What it tests: Whether you can ship changes repeatedly without heroics. Interviewers want specifics, not diagrams.

Approach: Trigger on PR → build/test/lint/scan → publish artifact → deploy to staging → verify (E2E + smoke) → promote same artifact to prod with progressive delivery + monitoring gates.

Key components: Runners, build caching, artifact registry, deployment controller, approval gates, rollback hooks, feature flags.

Bottlenecks: Flaky tests poisoning signal, slow builds killing velocity, shared runners causing queue congestion, environment drift between staging and prod.

Command: GitHub Actions YAML in .github/workflows/. Promote by digest and not tag, to guarantee staging == prod.

Numbers: Pipeline p95 < 15 min; unit tests < 5 min; staging soak 10–30 min before promotion; rollback < 2 min.

Hiring signal: You say "promote the same artifact, never rebuild for prod" as a non-negotiable principle.

Q6. Kubernetes fundamentals: architecture and core objects.

Difficulty: Senior

What it tests: Whether you can reason about production behavior and not just "K8s runs containers."

Approach: Control plane reconciles desired state. Nodes run workloads via kubelet. Core objects: Pod (unit of scheduling), Deployment (stateless replicas), Service (stable networking), ConfigMap/Secret (config injection), StatefulSet (sticky identity + storage).

Key components: API server, scheduler, etcd, controller manager; node components: kubelet, kube-proxy, container runtime.

Bottlenecks: etcd pressure at scale, mis-sized resource requests/limits causing evictions, noisy pods consuming node resources, cluster API rate limits.

Command: kubectl get pods -A and kubectl describe pod for events. kubectl logs -f --previous for crash loops.

Numbers: Production baseline: ≥3 nodes per AZ; target pod startup < 10s; set resource requests + limits on every container.

Hiring signal: You immediately ask "stateful or stateless?" and explain when you'd reach for StatefulSet vs Deployment.
INTERVIEW AI TOOLS

Want company-specific interview questions?

Open the Interview AI Tools menu and select Interview Questions to see the questions asked at your target company.

Open Interview Questions

Q7. Explain CI vs CD vs Continuous Deployment, what changes operationally?

Difficulty: Mid-level

What it tests: Whether you understand where risk moves when you automate each stage.

Approach: CI = integrate + test continuously (fast feedback on every commit). CD (Delivery) = always releasable, the decision to deploy is human. Continuous Deployment = prod deploy on every green build and the decision is automated.

Key components: Test strategy, artifact promotion gates, release approval policies, monitoring, automated rollback.

Bottlenecks: Fragile tests blocking CI, manual approvals bottlenecking delivery, poor observability making continuous deployment risky.

Numbers: Target change failure rate < 15%; delivery lead time < 1 day for mature pipelines; deployment frequency: daily or more.

Hiring signal: You explain that continuous deployment requires excellent observability and automated rollback before it's safe, not just a policy decision.

Q8. Deployment strategies: rolling vs blue/green vs canary, trade-offs and rollback.

Difficulty: Senior

What it tests: Risk management thinking. Interviewers are listening for judgment, not ideology about which strategy is "best."

Approach: Rolling = simplest, slow rollback. Blue/green = instant rollback by flipping traffic, costs double capacity. Canary = detect regressions with partial traffic (1–5%) before full rollout, most nuanced but highest operational overhead.

Key components: Traffic routing, health checks, automated rollback on error rate/latency spikes, schema compatibility, cache warmup.

Bottlenecks: Blue/green "split brain" data if schema changes aren't backward-compatible. Canary: session stickiness, cost of running extra fleet during soak.

Command: ECS blue/green uses CodeDeploy + target groups + listeners. K8s canary via Argo Rollouts or weighted Ingress rules.

Numbers: Start canary at 1–5% for 10–20 min; rollback trigger: error rate +0.5% over baseline; rollback execution < 2 min.

Hiring signal: You ask "what does the DB schema change look like?" before deciding on strategy and that's the voice of someone who's been burned.

Q9. How do you run Terraform at scale, state, locking, modules, drift?

Difficulty: Senior

What it tests: Whether your Terraform usage survives more than one engineer and one environment.

Approach: One state file per boundary (service + environment). Remote backend with locking. Opinionated shared modules for common patterns. CI-driven plan/apply - no local applies to prod.

Key components: Remote state (S3 + DynamoDB or Terraform Cloud), state locking, module registry, workspace strategy, drift detection cron.

Bottlenecks: State lock contention on shared infra, drift from engineers making manual console changes, monolithic state files that take 10+ min to plan.

Command: terraform plan in PR pipeline; terraform apply only from protected CI job. Lock timeout 2-5 min to surface contention.

Numbers: Module reuse > 70% for common patterns; plan time < 3 min per boundary; drift check daily minimum.

Hiring signal: You proactively describe how to break up a monolithic state file and the blast radius of a bad apply without prompting.

Q10. Git branching strategy: trunk-based vs GitFlow, and why it matters for delivery.

Difficulty: Mid-level

What it tests: Whether your branching model enables or blocks delivery speed.

Approach: High-velocity teams: trunk-based + short-lived branches (< 2 days) + feature flags for in-progress work. Scheduled releases: GitFlow can work but adds merge pain and integration debt. Choose based on deploy cadence, not habit.

Key components: Protected branches, PR quality gates, semantic versioning tags, feature flags, merge queue.

Bottlenecks: Long-lived branches create integration hell. Rebases without discipline create confusing history. Feature flags add runtime complexity if not cleaned up.

Command: Prefer git revert for prod fix audit trails. Enforce linear history via merge queues on high-traffic repos.

Numbers: Keep branches alive < 1–2 days; merge queue latency < 30 min; deploy lead time target < 1 day.

Hiring signal: You mention feature flags as the key enabler for trunk-based development at scale and not just "we merge to main a lot."

Q11. Terraform vs Ansible vs CloudFormation - provisioning vs config management.

Difficulty: Senior

What it tests: Understanding of the "provision vs configure" boundary and where each tool's desired-state model breaks down.

Approach: Terraform/CloudFormation: provision and manage cloud resources (infra layer). Ansible: configure OS/apps on top of that infra (config layer). Both are code - review, version, and test them the same way.

Key components: Idempotency, drift control, inventory management, secrets handling, promotion process across environments.

Bottlenecks: Configuration drift on long-lived hosts, environment parity issues, Ansible playbooks that aren't idempotent causing chaos on re-runs.

Numbers: Config convergence target < 10 min; drift MTTR < 1 business day; no manual SSH changes to any managed host.

Hiring signal: You take a position on immutable vs mutable infrastructure and explain the operational implications of each and not just describe both.

Q12. DevSecOps: how do you secure a CI/CD pipeline and the software supply chain?

Difficulty: Senior

What it tests: "Security as a stage in delivery" thinking, not security bolted on after the fact.

Approach: Shift left: SAST, SCA (dependency scanning), and secrets scanning in every PR. Secure build environment (ephemeral runners, no persistent credentials). Sign and verify artifacts. Least-privilege deploy identities. Continuous post-deploy scanning.

Key components: Credential boundaries, artifact provenance/SBOM, registry admission policies, IaC scanning, runner security.

Bottlenecks: False positive fatigue causing engineers to ignore alerts, slow security gates killing velocity, secret sprawl across repos and CI variables.

Numbers: Pipeline secret rotation ≤ 90 days; critical CVE fix SLA < 7 days; zero hard-coded secrets in repos (enforce via pre-commit hooks).

Hiring signal: You define the software supply chain attack surface (build env, dependencies, registry, deploy pipeline) before being asked to.
Most DevOps candidates fail not because they don't know the tools, but because they can't explain the trade-offs. These 30 questions fix that.
Most DevOps candidates fail not because they don't know the tools, but because they can't explain the trade-offs. These 30 questions fix that.

Q13. Networking fundamentals: DNS, TLS, HTTP, reverse proxy vs load balancer.

Difficulty: Mid-level

What it tests: Whether you can debug "it's slow" without guessing. Networking is where most production mysteries live.

Approach: DNS resolves names to IPs (TTL matters for failover speed). TLS encrypts transport (handshake cost, cert rotation). Reverse proxy terminates TLS, caches, routes. LB distributes traffic at L4 (TCP) or L7 (HTTP), very different failure modes.

Key components: Timeouts, keep-alives, connection limits, health checks, TLS termination point, buffer sizing.

Bottlenecks: TLS handshake cost at high RPS, connection limit exhaustion, bad caching headers causing origin overload, DNS TTL too low causing resolution storms.

Command: Validate with dig (DNS), curl -v (TLS/HTTP headers), and LB health check dashboards. Check for missing timeouts first.

Numbers: p95 TLS handshake < 50ms in-region; HTTP error rate < 0.1% baseline; DNS TTL 30–300s depending on failover requirements.

Hiring signal: You say "check timeouts before anything else" as your first debugging instinct for latency issues.

Q14. GitOps: explain it and implement it with Argo CD or Flux in production.

Difficulty: Senior

What it tests: Whether you can keep environments consistent, auditable, and recoverable without manual kubectl apply.

Approach: Git is the single source of truth for desired cluster state. PR merges change desired state. An operator (Argo CD/Flux) continuously reconciles actual cluster state to the repo. Rollback = git revert + reconcile.

Key components: Repo structure (env overlays with Kustomize or Helm), sync policies, drift detection, PR approval gates, secret handling strategy.

Bottlenecks: Secret management in a GitOps model is non-trivial (sealed secrets, external secrets operator). Repo sprawl for large orgs. Auto-sync risk in prod without manual approval gates.

Command: Enforce PR approvals + sync windows for prod. argocd app sync --prune carefully, it deletes orphaned resources.

Numbers: Drift detection < 5 min; promotion PR cycle < 2 hours for routine changes; auto-sync to prod only with SLO monitoring gates.

Hiring signal: You immediately ask about the secrets strategy and that's the first real problem in every GitOps implementation.
RESUME REVIEW

Before you apply, review your resume once more.

A quick resume scan can catch weak bullet points, vague wording, or missing impact. Run a final check before sending applications.

Review My Resume Add stronger action verbs →
Difficulty: Senior

What it tests: Whether you can answer "what happened?" with evidence during an incident and not just "we have logs somewhere."

Approach: Standardize structured JSON logs at the source. Ship via lightweight agents. Index centrally with retention tiers. Enforce correlation IDs on every log line. Design for compliance (PII masking, access controls) from day one.

Key components: Log schema standards, correlation/trace IDs, PII redaction rules, retention tiers (hot/warm/cold), RBAC on log access.

Bottlenecks: Ingestion cost at scale, unbounded cardinality in log fields, noisy debug logs from chatty services, PII leaking into logs.

Numbers: Hot retention 7-14 days, warm 30-60 days, cold archive 180+ days. Query p95 < 5s for hot data. Sample debug logs aggressively.

Hiring signal: You talk about log schema standards and correlation IDs as day-one requirements, not nice-to-haves.

Q16. Cloud fundamentals: which AWS/Azure/GCP primitives do you rely on daily, and why?

Difficulty: Mid-level

What it tests: Whether you can map architecture needs to managed services responsibly, without over-engineering or picking the wrong abstraction.

Approach: Anchor on a core set: compute (VMs/containers/functions), identity (IAM), networking (VPC/LB/DNS), storage (object/block/DB), observability (native + third-party), CI/CD. Understand the failure modes of each managed service you depend on.

Key components: IAM least privilege (most important), multi-AZ design, autoscaling policies, managed DB vs self-managed trade-offs.

Bottlenecks: Over-reliance on managed services without understanding SLAs, IAM complexity at scale, cross-region latency surprises.

Numbers: Multi-AZ as production baseline; new environment provisioning < 30 min via IaC; IAM role review quarterly.

Hiring signal: You have a clear answer to "multi-cloud: why or why not?" and it's based on operational cost, not FOMO.

Q17. Alerting design: how do you prevent alert storms and get actionable pages?

Difficulty: Senior

What it tests: Whether you understand how to protect on-call attention as a finite resource.

Approach: Alert on symptoms (user impact), not causes (CPU high). Route via Alertmanager with deduplication and grouping. Tie alerts to SLO burn rate, page when error budget is being consumed fast enough to matter.

Key components: Dashboards that survive a 3am wake-up, runbooks per alert, deduplication, silences, paging thresholds, escalation policy.

Bottlenecks: Per-instance alerts creating paging floods and the classic outage anti-pattern. Too many low-priority alerts training engineers to ignore pages.

Command: Prometheus alerting rules → Alertmanager for routing/silencing. Two-burn-rate alerts: fast burn (short window) + slow burn (long window).

Numbers: Page only if projected SLO burn > 5% of monthly budget in < 1 hour (fast) or 2% over 6 hours (slow). Review alert signal-to-noise quarterly.

Hiring signal: You describe the two-burn-rate alerting pattern from the Google SRE Workbook without being prompted.

Q18. Incident response: walk me through handling a critical production outage.

Difficulty: Senior

What it tests: The senior-level separator. Can you lead calmly and systematically, or do you just troubleshoot in panic mode?

Approach: Triage (stop the bleeding, halt deploys, revert last change), communicate (incident channel + status page update), mitigate (rollback or isolate), then learn (blameless postmortem with action items and owners).

Key components: Incident commander role, shared communication channel, customer-facing status updates, timeline reconstruction, postmortem template with action items.

Bottlenecks: Lack of observability forcing guesswork, unclear ownership causing confusion, risky deploy without a rollback path, hero culture preventing proper postmortems.

Numbers: First status update within 10 min; mitigation target 30 min; full postmortem within 5 business days; every action item needs an owner and a due date.

Hiring signal: You say "first move is always halt deploys + revert last change — stop adding new variables" before even starting diagnosis.

Q19. High availability: designing for AZ and region failures with graceful degradation.

Difficulty: Senior

What it tests: Design thinking, interviewers want more than "add more replicas."

Approach: Multi-AZ as the baseline. Multi-region when RPO/RTO requirements justify the complexity. Remove single points of failure. Design for partial failure: run read-only mode if writes fail, queue non-critical work, shed load gracefully before full outage.

Key components: Load balancers, autoscaling groups, data replication (sync vs async trade-offs), health checks, failover automation.

Bottlenecks: Shared databases as hidden SPOFs, global state management, thundering herds on failover, split-brain risk with async replication.

Numbers: 99.9% availability = ~43 min downtime/month. p95 API latency < 300ms under normal load. Test failover quarterly and not just plan for it.

Hiring signal: You distinguish between eliminating SPOFs and designing for graceful degradation, and they're different problems with different solutions.

Q20. Secrets management: store, inject, rotate, and audit secrets at scale.

Difficulty: Senior

What it tests: Whether you treat secrets as radioactive material, or as config that happens to be sensitive.

Approach: Central secret store (Vault, AWS Secrets Manager, GCP Secret Manager). Short-lived credentials where possible (dynamic secrets, OIDC). Inject at runtime, never bake into images or check into repos. Full audit log of all access.

Key components: RBAC with least privilege, automated rotation, break-glass access procedures, audit logs, pre-commit hooks to catch accidental commits.

Bottlenecks: Secret sprawl across repos and CI env vars, manual rotation processes that get missed, leaked tokens with no detection.

Numbers: Rotate high-privilege tokens every 30–90 days; incident response for a leaked secret starts < 15 min (revoke first, investigate second).

Hiring signal: You say "revoke first, ask questions later" for any suspected secret leak, with no hesitation.

Q21. DevOps vs Agile: how are they different, and how do they complement each other?

Difficulty: Mid-level

What it tests: Whether you can separate "how we build" from "how we run and deploy."

Approach: Agile optimizes iterative development, small batches, fast feedback, continuous improvement of the product. DevOps extends that feedback loop through deployment and operations - automation, shared ownership, fast recovery.

Key components: Sprint cadence, feedback loops, CI/CD connecting the two, monitoring closing the loop back into product decisions.

Numbers: Sprint cadence 1-2 weeks; deployment cadence should ideally match or exceed sprint cadence for high-performing teams.

Hiring signal: You explain the "DevOps team as gatekeepers" anti-pattern, where a separate DevOps team creates the same wall that DevOps was meant to remove.

Q22. Disaster recovery: define RTO/RPO and outline a DR strategy you'd actually test.

Difficulty: Senior

What it tests: Whether you can recover when the cloud, or your own team, has a bad day. Untested DR is not DR.

Approach: Define business-aligned RTO (how long can we be down?) and RPO (how much data can we lose?). Pick DR pattern: backup/restore (cheapest, slowest), warm standby (middle), active/active (most resilient, most expensive). Test quarterly and not annually.

Key components: Automated backups with integrity validation, data replication, runbooks, failover drills with success criteria, data consistency checks post-restore.

Command: Automate restore tests, validate data integrity, not just "job succeeded." Actually fail a component in staging as a drill.

Numbers: DR test quarterly minimum; RPO < 15 min for critical data; RTO < 1 hour for tier-1 services.

Hiring signal: You say "an untested backup is not a backup" and describe how you verify data integrity post-restore, not just completion status.

Q23. Cost and capacity: forecast usage, set autoscaling, and control spend.

Difficulty: Senior

What it tests: Senior DevOps owns reliability AND efficiency. Waste is an outage in slow motion.

Approach: Use SLOs + load patterns to size the baseline. Autoscale for peaks. Track unit cost per request or per user. Rightsize continuously, over-provisioning is a tax on every team.

Key components: Budget alerts and hard limits, resource tagging for cost attribution, reserved/spot instance strategy, HPA/ASG policies with sensible min/max bounds.

Bottlenecks: Autoscaling runaway from a bug causing infinite scaling, cost allocation without tagging becoming a black box, surprise egress bills.

Numbers: Target compute utilization 40-60% at baseline; cost review monthly; rightsize target 10-20% reduction per quarter in mature estates.

Hiring signal: You frame cost as an engineering constraint and not a finance problem. "Cost per request" is your primary unit, not total spend.

Q24. Kubernetes troubleshooting: a pod is CrashLoopBackOff, what's your checklist?

Difficulty: Senior

What it tests: Whether your debugging is systematic and calm, not just vibes and random log-scrolling.

Approach: Step 1: kubectl describe pod read the Events section. Step 2: kubectl logs -f --previous and see the actual crash. Step 3: check config/secrets/env vars. Step 4: check resource limits (OOMKilled?). Step 5: check dependencies (DB, DNS, downstream services).

Key components: Readiness/liveness probe misconfiguration, image pull failures, missing env vars, mount failures, resource limits too tight, upstream dependency down.

Command: kubectl describe pod <n> -n <ns>kubectl logs <n> --previouskubectl get events --sort-by=.metadata.creationTimestamp

Numbers: Triage within 5 min with a clear runbook. Keep per-service runbooks that cover the top 5 crash causes for that service.

Hiring signal: You check resource limits (OOMKilled) within the first 3 steps — most engineers scroll logs for 10 minutes before checking the obvious.

Q25. SLI/SLO/SLA and error budgets: translating reliability into engineering decisions.

Difficulty: Senior

What it tests: Whether you can turn reliability into a concrete engineering contract that drives actual decisions and not just a monitoring dashboard.

Approach: Pick user-centric SLIs (latency, availability, error rate). Set realistic SLOs based on business need and measurement capability. Derive error budget (the allowed downtime/errors). Use burn rate to decide when to pause new feature launches and focus on reliability.

Key components: Measurement window (rolling 28 days), error budget policy (who decides to stop launches), burn rate alerts, SLO review cadence.

Numbers: 99.9% monthly availability = ~43 min error budget. Fast burn alert: >2% per hour. Slow burn alert: >5% over 6 hours. Review SLOs quarterly.

Hiring signal: You explain that an error budget makes reliability a shared conversation between product and engineering and not just an ops metric.

Q26. Artifact management and versioning: tag, store, and promote builds reliably.

Difficulty: Mid-level

What it tests: Release discipline. Can you reproduce exactly what's running in production from a commit hash six months later?

Approach: Build once, store immutably, promote the same artifact across all environments. Tag with commit SHA + semantic version. Pin image by digest in deploy manifests, never by mutable tag like "latest."

Key components: Artifact registry, SBOM/signing, semantic versioning, build metadata (commit, pipeline run), retention and cleanup policy.

Command: In K8s manifests: image: registry/app@sha256:abc123 and never image: registry/app:latest.

Numbers: Artifact retention minimum 90 days; rollback to any previous version < 5 min for stateless services.

Hiring signal: You say "never rebuild for prod, promote the exact artifact that passed staging" as an immovable principle.

Q27. Database migrations in DevOps: zero-downtime schema changes.

Difficulty: Senior

What it tests: Whether you can ship DB changes without taking the site down or creating a 2am emergency.

Approach: Expand/contract pattern. Phase 1: add new column/table (backward-compatible), deploy app that reads both old and new. Phase 2: migrate data. Phase 3: remove old structure in a later separate deploy. Never combine a schema change and an application change in a single deploy.

Key components: Expand/contract pattern, feature flags to control cut-over, idempotent migration tooling (Flyway/Liquibase), migration verification step, rollback posture.

Command: Run migrations as a separate pre-deploy step. Enforce lock timeouts < 1s to prevent table-lock outages on busy tables.

Numbers: Each migration phase < 5 min in prod; lock timeout < 1s; large backfills run as async background jobs, not migration scripts.

Hiring signal: You describe the expand/contract pattern unprompted and explain why you never combine a schema change and an application change in one deploy.

Q28. Reliability patterns: retries, backoff, idempotency, circuit breakers.

Difficulty: Senior

What it tests: Distributed systems instincts and how you prevent small failures from cascading into large ones.

Approach: Timeouts first (always, everything needs a timeout). Retries with jittered exponential backoff for transient errors. Idempotency keys for safe repeats on non-idempotent operations. Circuit breakers when a dependency is consistently flapping. Bulkheads to isolate blast radius.

Key components: Client-side timeouts, retry budget, jitter, idempotency key storage, circuit breaker thresholds, bulkhead thread pools/queues.

Bottlenecks: Retries without jitter causing synchronized retry storms. Missing idempotency causing duplicate charges/orders. Circuit breakers misconfigured to trip on normal variance.

Numbers: Exponential backoff starting at 100ms with ±50% jitter; max 3–5 retries; client timeout < p99 latency + margin.

Hiring signal: You explain when retries make things worse (non-idempotent operations, cascading overload) and not just when they help.

Q29. Kubernetes security: RBAC, network policies, and admission controls.

Difficulty: Senior

What it tests: Whether you can keep a cluster safe without blocking engineers from doing their jobs.

Approach: Least-privilege RBAC (no cluster-admin for app teams). Namespace isolation + network policies to restrict lateral movement. Admission controllers (OPA/Gatekeeper or Kyverno) to enforce policies at deploy time and not discovery time. Image scanning in registry and at admission.

Key components: Service account per workload (not default), sealed/external secrets, network policies for ingress + egress, pod security standards, approved image registries only.

Command: Deny privileged pods, require resource limits, restrict image registries, enforce at admission so engineers get fast feedback, not a prod incident.

Numbers: Zero cluster-admin bindings for app teams; rotate service tokens ≤ 90 days; audit RBAC bindings monthly.

Hiring signal: You mention pod security standards (restricted/baseline/privileged) and why you enforce them at admission, not discovery.

Q30. Tool selection and adoption: evaluating tools and driving standardization.

Difficulty: Senior

What it tests: Senior judgment, can you pick tools that survive scale, organizational politics, and the next engineer who joins the team?

Approach: Start from requirements (scale, compliance, developer UX, operational burden). Run a time-boxed pilot (2-4 weeks) with real workloads. Measure outcomes against baseline. Standardize via paved paths with good defaults, and not just mandates. Plan for migration and deprecation from day one.

Key components: Pilot success criteria, migration plan, training/documentation, ownership model, SLA for the platform team, deprecation policy.

Bottlenecks: Tool sprawl from "we allow everything." Forced standardization killing velocity by not fitting team needs. No deprecation path leaving zombie tools in prod forever.

Numbers: Pilot timeline 2-4 weeks; adoption target > 60% in first quarter if it's genuinely better; revisit tool choices annually.

Hiring signal: You use the phrase "paved path", good defaults that make the right thing easy, while still allowing exceptions with justification.

Before your interview

DevOps Glossary Quick Brush-Up

Scan this before your interview. These are the terms candidates most often stumble on.

Fundamentals

DevOps
Culture and automation that remove friction between building and running software.
SRE
Google’s prescriptive implementation of DevOps. Engineers apply software thinking to ops problems.
DORA metrics
Deploy frequency, lead time, change failure rate, and MTTR.
Toil
Manual, repetitive ops work that scales badly and should be automated away.
Blameless postmortem
Incident review focused on system failures, not personal blame.
Shift left
Move testing and security earlier into the PR and build process.
Paved path
Opinionated defaults that make the right engineering choice easier.
Immutable infrastructure
Servers are replaced with new versions instead of patched in place.

CI/CD

CI
Automatically build and test every commit for fast feedback.
CD
Every build is releasable. A human still decides when to deploy.
Continuous deployment
Every green build automatically ships to production.
Pipeline
Automated chain of steps: build, test, scan, publish, deploy.
Artifact
Immutable build output such as a container image or binary.
Flaky test
A non-deterministic test that destroys confidence in the pipeline.
Feature flag
Runtime toggle that separates deploy from release.
Rollback
Return to a previous known-good version after a bad deploy.

Infrastructure as Code

IaC
Managing infrastructure through code instead of manual console work.
Terraform
Declarative IaC tool used to manage cloud resources.
Ansible
Agentless config management tool for OS and app setup.
State file
Terraform’s record of what it believes exists in the real world.
State locking
Prevents multiple engineers from applying infra changes at once.
Drift
When real infrastructure no longer matches the code definition.
Idempotent
Running the same operation repeatedly gives the same result.
Module
Reusable versioned IaC building block, similar to a function.

Containers & Kubernetes

Container
Isolated process sharing the host kernel. Lightweight and fast to start.
Image
Immutable layered snapshot. A container is a running instance of an image.
Pod
Smallest schedulable unit in Kubernetes.
Deployment
Kubernetes object for stateless replicas with rolling updates.
StatefulSet
Kubernetes object for stateful workloads with stable identity and storage.
Service
Stable network endpoint for a set of pods.
Ingress
Kubernetes object for external HTTP and HTTPS routing.
CrashLoopBackOff
A pod that keeps crashing and restarting repeatedly.
OOMKilled
Container killed because it exceeded its memory limit.
etcd
Distributed key-value store holding Kubernetes cluster state.

Observability

Metrics
Numeric measurements over time such as latency, error rate, or CPU.
Logs
Timestamped records of events, ideally structured and queryable.
Traces
End-to-end record of a request flowing through services.
Span
One unit of work inside a distributed trace.
Correlation ID
Unique ID carried across services to connect logs and traces.
SLI
Measured reliability metric such as p99 latency or error rate.
SLO
Target reliability objective the service aims to meet.
Error budget
Allowed downtime or errors before the SLO is breached.
Burn rate
How quickly the service is consuming its error budget.
Alert fatigue
Too many low-signal alerts causing engineers to ignore pages.

Deployment, Reliability & Security

Rolling deploy
Replace instances gradually. Simple but slower to roll back.
Blue/green deploy
Two environments, instant traffic switch, fast rollback, higher cost.
Canary deploy
Send a small share of traffic to the new version first.
Circuit breaker
Stops calls to a failing dependency to prevent cascade failures.
Retry with backoff
Retry transient failures with increasing delays and jitter.
Idempotency key
Token that makes a non-idempotent operation safe to retry.
RTO
Maximum acceptable recovery time after an outage.
RPO
Maximum acceptable data loss window measured in time.
Chaos engineering
Injecting failures deliberately to test resilience.
Least privilege
Grant only the minimum permissions required.
Secret sprawl
Credentials scattered across repos, CI vars, and config files.
SAST
Static code scanning for vulnerabilities during build time.
SCA
Dependency scanning for known CVEs and supply chain risk.
SBOM
Inventory of all software components included in a build.

Frequently Asked Questions About DevOps Interviews

What do DevOps interview questions actually test?

At mid-to-senior level, DevOps interviews are usually testing four things: operational judgment, change safety, systems thinking, and production realism. Interviewers want to know whether you can ship safely, automate repetitive work, run infrastructure at scale, and stay calm when production breaks.

How should I structure a strong DevOps interview answer?

A strong answer usually follows a simple structure: explain what problem the tool or pattern is solving, describe how you would approach it, call out failure modes, add one or two concrete numbers or targets, and then explain what changes at 10x scale.

What are the most common topics in DevOps interviews?

The most common topics are CI/CD pipelines, Infrastructure as Code, Docker, Kubernetes, observability, deployment strategies, incident response, secrets management, and reliability concepts like SLOs, error budgets, retries, and rollback design.

What is the difference between CI, CD, and continuous deployment?

CI means every commit is automatically built and tested. Continuous Delivery means every build is releasable, but a human still decides when to deploy. Continuous Deployment means every green build automatically goes to production with no human gate.

What is a good answer to “What is DevOps?” in an interview?

A good answer goes beyond buzzwords. DevOps is a way of reducing friction between building and running software through shared ownership, automation, fast feedback, and safer releases. Strong candidates usually connect that definition to outcomes like deploy frequency, MTTR, and change failure rate.

Do I need Kubernetes knowledge for a DevOps interview?

For many modern DevOps and SRE roles, yes. You should be comfortable with core Kubernetes concepts like Pods, Deployments, Services, StatefulSets, Ingress, troubleshooting CrashLoopBackOff errors, and understanding how the control plane and nodes interact.

What mistakes cause candidates to lose points in DevOps interviews?

The most common mistakes are giving tool definitions without trade-offs, ignoring rollback and failure modes, speaking in vague terms without numbers, describing manual production processes as acceptable, and failing to connect security, observability, or incident response back to delivery risk.

How should I practice for a DevOps interview?

Pick five to ten high-frequency questions and answer them out loud. Practice explaining pipelines, deploy strategies, IaC safety, Kubernetes troubleshooting, and incident response using real examples. The best answers sound like someone who has actually operated systems, not someone reciting a certification guide.

What are SLOs and error budgets, and why do interviewers ask about them?

SLOs are reliability targets for a service, such as availability or latency goals. An error budget is the amount of failure you are allowed before breaching that target. Interviewers ask about them because they reveal whether you understand how reliability should guide release decisions, alerting, and engineering prioritization.

What DevOps interview questions come up most often?

Common recurring questions include Docker vs VMs, CI/CD pipeline design, Terraform safety, Kubernetes basics, rolling vs blue-green vs canary deploys, observability, incident response, secrets management, GitOps, disaster recovery, and SLI/SLO/error budget design.